Search | arXiv e-print repository

arXiv:2406.01938 [pdf, other]

Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing

Authors: Zhengyi Kwan, Wei Zhang, Zhengkui Wang, Aik Beng Ng, Simon See

Abstract: Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder an… ▽ More Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 10 pages

arXiv:2405.13629 [pdf, other]

Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow

Authors: Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, Chun-Yi Lee

Abstract: Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance wit… ▽ More Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance with the updated soft Q-function. In this paper, we introduce a new MaxEnt RL framework modeled using Energy-Based Normalizing Flows (EBFlow). This framework integrates the policy evaluation steps and the policy improvement steps, resulting in a single objective training process. Our method enables the calculation of the soft value function used in the policy evaluation target without Monte Carlo approximation. Moreover, this design supports the modeling of multi-modal action distributions while facilitating efficient action sampling. To evaluate the performance of our method, we conducted experiments on the MuJoCo benchmark suite and a number of high-dimensional robotic tasks simulated by Omniverse Isaac Gym. The evaluation results demonstrate that our method achieves superior performance compared to widely-adopted representative baselines. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.02630 [pdf, other]

cuTN-QSVM: cuTensorNet-accelerated Quantum Support Vector Machine with cuQuantum SDK

Authors: Kuan-Cheng Chen, Tai-Yue Li, Yun-Yuan Wang, Simon See, Chun-Chieh Wang, Robert Wille, Nan-Yow Chen, An-Cheng Yang, Chun-Yu Lin

Abstract: This paper investigates the application of Quantum Support Vector Machines (QSVMs) with an emphasis on the computational advancements enabled by NVIDIA's cuQuantum SDK, especially leveraging the cuTensorNet library. We present a simulation workflow that substantially diminishes computational overhead, as evidenced by our experiments, from exponential to quadratic cost. While state vector simulatio… ▽ More This paper investigates the application of Quantum Support Vector Machines (QSVMs) with an emphasis on the computational advancements enabled by NVIDIA's cuQuantum SDK, especially leveraging the cuTensorNet library. We present a simulation workflow that substantially diminishes computational overhead, as evidenced by our experiments, from exponential to quadratic cost. While state vector simulations become infeasible for qubit counts over 50, our evaluation demonstrates that cuTensorNet speeds up simulations to be completed within seconds on the NVIDIA A100 GPU, even for qubit counts approaching 784. By employing multi-GPU processing with Message Passing Interface (MPI), we document a marked decrease in computation times, effectively demonstrating the strong linear speedup of our approach for increasing data sizes. This enables QSVMs to operate efficiently on High-Performance Computing (HPC) systems, thereby opening a new window for researchers to explore complex quantum algorithms that have not yet been investigated. In accuracy assessments, our QSVM achieves up to 95\% on challenging classifications within the MNIST dataset for training sets larger than 100 instances, surpassing the capabilities of classical SVMs. These advancements position cuTensorNet within the cuQuantum SDK as a pivotal tool for scaling quantum machine learning simulations and potentially signpost the seamless integration of such computational strategies as pivotal within the Quantum-HPC ecosystem. △ Less

Submitted 8 May, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: 10 pages, 14 figures

arXiv:2402.10646 [pdf, other]

AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation

Authors: Zhaowei Wang, Wei Fan, Qing Zong, Hongming Zhang, Sehyun Choi, Tianqing Fang, Xin Liu, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LL… ▽ More Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LLMs in capturing the underlying rationale of abstraction. Meanwhile, we introduce a plausibility estimator to select instructions that are more consistent with the abstraction knowledge of LLMs to be aligned. Then, our framework combines abstraction instructions with general-purpose ones to build a hybrid dataset. Extensive experiments and analyses demonstrate that our framework can considerably enhance LLMs' abstraction ability with strong generalization performance while maintaining their general instruction-following abilities. △ Less

Submitted 17 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024

arXiv:2401.15977 [pdf, other]

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Authors: Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li

Abstract: We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video map**, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the ref… ▽ More We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video map**, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/. △ Less

Submitted 31 January, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Project page: https://xiaoyushi97.github.io/Motion-I2V/

arXiv:2401.14619 [pdf, other]

Resilient Practical Test-Time Adaptation: Soft Batch Normalization Alignment and Entropy-driven Memory Bank

Authors: Xingzhi Zhou, Zhiliang Tian, Ka Chun Cheung, Simon See, Nevin L. Zhang

Abstract: Test-time domain adaptation effectively adjusts the source domain model to accommodate unseen domain shifts in a target domain during inference. However, the model performance can be significantly impaired by continuous distribution changes in the target domain and non-independent and identically distributed (non-i.i.d.) test samples often encountered in practical scenarios. While existing memory… ▽ More Test-time domain adaptation effectively adjusts the source domain model to accommodate unseen domain shifts in a target domain during inference. However, the model performance can be significantly impaired by continuous distribution changes in the target domain and non-independent and identically distributed (non-i.i.d.) test samples often encountered in practical scenarios. While existing memory bank methodologies use memory to store samples and mitigate non-i.i.d. effects, they do not inherently prevent potential model degradation. To address this issue, we propose a resilient practical test-time adaptation (ResiTTA) method focused on parameter resilience and data quality. Specifically, we develop a resilient batch normalization with estimation on normalization statistics and soft alignments to mitigate overfitting and model degradation. We use an entropy-driven memory bank that accounts for timeliness, the persistence of over-confident samples, and sample uncertainty for high-quality data in adaptation. Our framework periodically adapts the source domain model using a teacher-student model through a self-training loss on the memory samples, incorporating soft alignment losses on batch normalization. We empirically validate ResiTTA across various benchmark datasets, demonstrating state-of-the-art performance. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2310.05210 [pdf, other]

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Authors: Qing Zong, Zhaowei Wang, Baixuan Xu, Tianshi Zheng, Haochen Shi, Weiqi Wang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argumen… ▽ More A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard of Argumentative Stance Classification subtask in this shared task. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: Accepted to the 10th Workshop on Argument Mining, co-located with EMNLP 2023

arXiv:2309.08303 [pdf, other]

Self-Consistent Narrative Prompts on Abductive Natural Language Inference

Authors: Chunkit Chan, Xin Liu, Tsz Ho Chan, Jiayang Cheng, Yangqiu Song, Ginny Wong, Simon See

Abstract: Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($α$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited… ▽ More Abduction has long been seen as crucial for narrative comprehension and reasoning about everyday situations. The abductive natural language inference ($α$NLI) task has been proposed, and this narrative text-based task aims to infer the most plausible hypothesis from the candidates given two observations. However, the inter-sentential coherence and the model consistency have not been well exploited in the previous works on this task. In this work, we propose a prompt tuning model $α$-PACE, which takes self-consistency and inter-sentential coherence into consideration. Besides, we propose a general self-consistent framework that considers various narrative sequences (e.g., linear narrative and reverse chronology) for guiding the pre-trained language model in understanding the narrative context of input. We conduct extensive experiments and thorough ablation studies to illustrate the necessity and effectiveness of $α$-PACE. The performance of our method shows significant improvement against extensive competitive baselines. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted at IJCNLP-AACL 2023 main track

arXiv:2308.05396 [pdf, other]

Learning Gabor Texture Features for Fine-Grained Recognition

Authors: Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, Jun Liu

Abstract: Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address… ▽ More Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address the challenge, we propose a novel texture branch as complimentary to the CNN branch for feature extraction. We innovatively utilize Gabor filters as a powerful extractor to exploit texture features, motivated by the capability of Gabor filters in effectively capturing multi-frequency features and detailed local information. We implement several designs to enhance the effectiveness of Gabor filters, including imposing constraints on parameter values and develo** a learning method to determine the optimal parameters. Moreover, we introduce a statistical feature extractor to utilize informative statistical information from the signals captured by Gabor filters, and a gate selection mechanism to enable efficient computation by only considering qualified regions as input for texture extraction. Through the integration of features from the Gabor-filter-based texture branch and CNN-based semantic branch, we achieve comprehensive information extraction. We demonstrate the efficacy of our method on multiple datasets, including CUB-200-2011, NA-bird, Stanford Dogs, and GTOS-mobile. State-of-the-art performance is achieved using our approach. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV2023

arXiv:2308.00055 [pdf, other]

doi 10.1145/3639477.3639740

Towards Building AI-CPS with NVIDIA Isaac Sim: An Industrial Benchmark and Case Study for Robotics Manipulation

Authors: Zhehua Zhou, Jiayang Song, Xuan Xie, Zhan Shu, Lei Ma, Dikai Liu, Jianxiong Yin, Simon See

Abstract: As a representative cyber-physical system (CPS), robotic manipulator has been widely adopted in various academic research and industrial processes, indicating its potential to act as a universal interface between the cyber and the physical worlds. Recent studies in robotics manipulation have started employing artificial intelligence (AI) approaches as controllers to achieve better adaptability and… ▽ More As a representative cyber-physical system (CPS), robotic manipulator has been widely adopted in various academic research and industrial processes, indicating its potential to act as a universal interface between the cyber and the physical worlds. Recent studies in robotics manipulation have started employing artificial intelligence (AI) approaches as controllers to achieve better adaptability and performance. However, the inherent challenge of explaining AI components introduces uncertainty and unreliability to these AI-enabled robotics systems, necessitating a reliable development platform for system design and performance assessment. As a foundational step towards building reliable AI-enabled robotics systems, we propose a public industrial benchmark for robotics manipulation in this paper. It leverages NVIDIA Omniverse Isaac Sim as the simulation platform, encompassing eight representative manipulation tasks and multiple AI software controllers. An extensive evaluation is conducted to analyze the performance of AI controllers in solving robotics manipulation tasks, enabling a thorough understanding of their effectiveness. To further demonstrate the applicability of our benchmark, we develop a falsification framework that is compatible with physical simulators and OpenAI Gym environments. This framework bridges the gap between traditional testing methods and modern physics engine-based simulations. The effectiveness of different optimization methods in falsifying AI-enabled robotics manipulation with physical simulators is examined via a falsification test. Our work not only establishes a foundation for the design and development of AI-enabled robotics systems but also provides practical experience and guidance to practitioners in this field, promoting further research in this critical academic and industrial domain. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.11526 [pdf, other]

CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields

Authors: Ziyuan Luo, Qing Guo, Ka Chun Cheung, Simon See, Renjie Wan

Abstract: Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with… ▽ More Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with a watermarked color representation. Then, a distortion-resistant rendering scheme is designed to guarantee robust message extraction in 2D renderings of NeRF. Our proposed method can directly protect the copyright of NeRF models while maintaining high rendering quality and bit accuracy when compared among optional solutions. △ Less

Submitted 29 July, 2023; v1 submitted 21 July, 2023; originally announced July 2023.

Comments: 11 pages, 6 figures, accepted by ICCV 2023 non-camera-ready version

arXiv:2307.06876 [pdf, other]

doi 10.1021/acsphotonics.4c00291

Light Emission and Conductance Fluctuations in Electrically Driven and Plasmonically Enhanced Molecular Junctions

Authors: Sakthi Priya Amirtharaj, Zhiyuan Xie, Josephine Si Yu See, Gabriele Rolleri, Wen Chen, Konstantin Malchow, Alexandre Bouhelier, Emanuel Lörtscher, Christophe Galland

Abstract: Electrically connected and plasmonically enhanced molecular junctions combine the optical functionalities of high field confinement and enhancement (cavity function), and of high radiative efficiency (antenna function) with the electrical functionalities of molecular transport. Such combined optical and electrical probes have proven useful for the fundamental understanding of metal-molecule contac… ▽ More Electrically connected and plasmonically enhanced molecular junctions combine the optical functionalities of high field confinement and enhancement (cavity function), and of high radiative efficiency (antenna function) with the electrical functionalities of molecular transport. Such combined optical and electrical probes have proven useful for the fundamental understanding of metal-molecule contacts and contribute to the development of nanoscale optoelectronic devices including ultrafast electronics and nanosensors. Here, we employ a self-assembled metal-molecule-metal junction with a nanoparticle bridge to investigate correlated fluctuations in conductance and tunneling-induced light emission at room temperature. Despite the presence of hundreds of molecules in the junction, the electrical conductance and light emission are both highly sensitive to atomic-scale fluctuations -- a phenomenology reminiscent of picocavities observed in Raman scattering and of luminescence blinking from photo-excited plasmonic junctions. Discrete steps in conductance associated with fluctuating emission intensities through the multiple plasmonic modes of the junction are consistent with a finite number of randomly localized, point-like sources dominating the optoelectronic response. Contrasting with these microscopic fluctuations, the overall plasmonic and electronic functionalities of our devices feature long-term survival at room temperature and under an electrical bias of a few volts, allowing for measurements over several months. △ Less

Submitted 28 March, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

Journal ref: ACS Photonics 2024

arXiv:2306.08306 [pdf, other]

doi 10.1145/3581783.3612463

Towards Balanced Active Learning for Multimodal Classification

Authors: Meng Shen, Yizheng Huang, Jianxiong Yin, Heqing Zou, Deepu Rajan, Simon See

Abstract: Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multi… ▽ More Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification. △ Less

Submitted 21 August, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: 12 pages, accepted by ACMMM 2023

arXiv:2306.05888 [pdf, other]

TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Authors: Xuesong Chen, Shaoshuai Shi, Chao Zhang, Ben** Zhu, Qiang Wang, Ka Chun Cheung, Simon See, Hongsheng Li

Abstract: 3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover o… ▽ More 3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover objects missed by the detector. In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D MOT framework. To recover the missed object by detector, we generates multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and current-frame detection boxes, for trajectory-box association. The predicted boxes can propagate object's history trajectory information to the current frame and thus the network can tolerate short-term miss detection of the tracked objects. We combine long-term object motion feature and short-term object appearance feature to create per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. Additionally, we introduce a Global-Local Interaction Module to conduct information interaction among all hypotheses and models their spatial relations, leading to accurate estimation of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks. Code is available at https://github.com/poodarchu/EFG . △ Less

Submitted 18 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted by ICCV 2023

arXiv:2306.02430 [pdf, other]

A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning

Authors: Wei-Fang Sun, Cheng-Kuang Lee, Simon See, Chun-Yi Lee

Abstract: In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected v… ▽ More In fully cooperative multi-agent reinforcement learning (MARL) settings, environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of other agents. To address the above issues, we proposed a unified framework, called DFAC, for integrating distributional RL with value function factorization methods. This framework generalizes expected value function factorization methods to enable the factorization of return distributions. To validate DFAC, we first demonstrate its ability to factorize the value functions of a simple matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps of the StarCraft Multi-Agent Challenge and six self-designed Ultra Hard maps, showing that DFAC is able to outperform a number of baselines. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: JMLR 2023. Extended version of arXiv:2102.07936

arXiv:2305.05191 [pdf, other]

COLA: Contextualized Commonsense Causal Reasoning from the Causal Inference Perspective

Authors: Zhaowei Wang, Quyet V. Do, Hongming Zhang, Jiayao Zhang, Weiqi Wang, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Detecting commonsense causal relations (causation) between events has long been an essential yet challenging task. Given that events are complicated, an event may have different causes under various contexts. Thus, exploiting context plays an essential role in detecting causal relations. Meanwhile, previous works about commonsense causation only consider two events and ignore their context, simpli… ▽ More Detecting commonsense causal relations (causation) between events has long been an essential yet challenging task. Given that events are complicated, an event may have different causes under various contexts. Thus, exploiting context plays an essential role in detecting causal relations. Meanwhile, previous works about commonsense causation only consider two events and ignore their context, simplifying the task formulation. This paper proposes a new task to detect commonsense causation between two events in an event sequence (i.e., context), called contextualized commonsense causal reasoning. We also design a zero-shot framework: COLA (Contextualized Commonsense Causality Reasoner) to solve the task from the causal inference perspective. This framework obtains rich incidental supervision from temporality and balances covariates from multiple timestamps to remove confounding effects. Our extensive experiments show that COLA can detect commonsense causality more accurately than baselines. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted to the main conference of ACL 2023

arXiv:2305.04034 [pdf, other]

Wasserstein-Fisher-Rao Embedding: Logical Query Embeddings with Local Comparison and Global Transport

Authors: Zihao Wang, Weizhi Fei, Hang Yin, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Answering complex queries on knowledge graphs is important but particularly challenging because of the data incompleteness. Query embedding methods address this issue by learning-based models and simulating logical reasoning with set operators. Previous works focus on specific forms of embeddings, but scoring functions between embeddings are underexplored. In contrast to existing scoring functions… ▽ More Answering complex queries on knowledge graphs is important but particularly challenging because of the data incompleteness. Query embedding methods address this issue by learning-based models and simulating logical reasoning with set operators. Previous works focus on specific forms of embeddings, but scoring functions between embeddings are underexplored. In contrast to existing scoring functions motivated by local comparison or global transport, this work investigates the local and global trade-off with unbalanced optimal transport theory. Specifically, we embed sets as bounded measures in $\real$ endowed with a scoring function motivated by the Wasserstein-Fisher-Rao metric. Such a design also facilitates closed-form set operators in the embedding space. Moreover, we introduce a convolution-based algorithm for linear time computation and a block-diagonal kernel to enforce the trade-off. Results show that WFRE can outperform existing query embedding methods on standard datasets, evaluation sets with combinatorially complex queries, and hierarchical knowledge graphs. Ablation study shows that finding a better local and global trade-off is essential for performance improvement. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: Findings in ACL 2023. 16 pages, 6 figures, and 8 tables. Our implementation can be found at https://github.com/HKUST-KnowComp/WFRE

arXiv:2305.03973 [pdf, other]

DiscoPrompt: Path Prediction Prompt Tuning for Implicit Discourse Relation Recognition

Authors: Chunkit Chan, Xin Liu, Jiayang Cheng, Zihan Li, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Implicit Discourse Relation Recognition (IDRR) is a sophisticated and challenging task to recognize the discourse relations between the arguments with the absence of discourse connectives. The sense labels for each discourse relation follow a hierarchical classification scheme in the annotation process (Prasad et al., 2008), forming a hierarchy structure. Most existing works do not well incorporat… ▽ More Implicit Discourse Relation Recognition (IDRR) is a sophisticated and challenging task to recognize the discourse relations between the arguments with the absence of discourse connectives. The sense labels for each discourse relation follow a hierarchical classification scheme in the annotation process (Prasad et al., 2008), forming a hierarchy structure. Most existing works do not well incorporate the hierarchy structure but focus on the syntax features and the prior knowledge of connectives in the manner of pure text classification. We argue that it is more effective to predict the paths inside the hierarchical tree (e.g., "Comparison -> Contrast -> however") rather than flat labels (e.g., Contrast) or connectives (e.g., however). We propose a prompt-based path prediction method to utilize the interactive information and intrinsic senses among the hierarchy in IDRR. This is the first work that injects such structure information into pre-trained language models via prompt tuning, and the performance of our solution shows significant and consistent improvement against competitive baselines. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: Accepted to Findings of ACL 2023

arXiv:2304.05015 [pdf, other]

Continual Semantic Segmentation with Automatic Memory Sample Selection

Authors: Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, Jun Liu

Abstract: Continual Semantic Segmentation (CSS) extends static semantic segmentation by incrementally introducing new classes for training. To alleviate the catastrophic forgetting issue in CSS, a memory buffer that stores a small number of samples from the previous classes is constructed for replay. However, existing methods select the memory samples either randomly or based on a single-factor-driven handc… ▽ More Continual Semantic Segmentation (CSS) extends static semantic segmentation by incrementally introducing new classes for training. To alleviate the catastrophic forgetting issue in CSS, a memory buffer that stores a small number of samples from the previous classes is constructed for replay. However, existing methods select the memory samples either randomly or based on a single-factor-driven handcrafted strategy, which has no guarantee to be optimal. In this work, we propose a novel memory sample selection mechanism that selects informative samples for effective replay in a fully automatic way by considering comprehensive factors including sample diversity and class performance. Our mechanism regards the selection operation as a decision-making process and learns an optimal selection policy that directly maximizes the validation performance on a reward set. To facilitate the selection decision, we design a novel state representation and a dual-stage action space. Our extensive experiments on Pascal-VOC 2012 and ADE 20K datasets demonstrate the effectiveness of our approach with state-of-the-art (SOTA) performance achieved, outperforming the second-place one by 12.54% for the 6stage setting on Pascal-VOC 2012. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: Accepted to CVPR2023

arXiv:2303.08340 [pdf, other]

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

Authors: Xiaoyu Shi, Zhaoyang Huang, Weikang Bian, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li

Abstract: We introduce VideoFlow, a novel optical flow estimation framework for videos. In contrast to previous methods that learn to estimate optical flow from two frames, VideoFlow concurrently estimates bi-directional optical flows for multiple frames that are available in videos by sufficiently exploiting temporal cues. We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directiona… ▽ More We introduce VideoFlow, a novel optical flow estimation framework for videos. In contrast to previous methods that learn to estimate optical flow from two frames, VideoFlow concurrently estimates bi-directional optical flows for multiple frames that are available in videos by sufficiently exploiting temporal cues. We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner. The information of the frame triplet is iteratively fused onto the center frame. To extend TROF for handling more frames, we further propose a MOtion Propagation (MOP) module that bridges multiple TROFs and propagates motion features between adjacent TROFs. With the iterative flow estimation refinement, the information fused in individual TROFs can be propagated into the whole sequence via MOP. By effectively exploiting video information, VideoFlow presents extraordinary performance, ranking 1st on all public benchmarks. On the Sintel benchmark, VideoFlow achieves 1.649 and 0.991 average end-point-error (AEPE) on the final and clean passes, a 15.1% and 7.6% error reduction from the best-published results (1.943 and 1.073 from FlowFormer++). On the KITTI-2015 benchmark, VideoFlow achieves an F1-all error of 3.65%, a 19.2% error reduction from the best-published result (4.52% from FlowFormer++). Code is released at \url{https://github.com/XiaoyuShi97/VideoFlow}. △ Less

Submitted 20 August, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

arXiv:2303.01237 [pdf, other]

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

Authors: Xiaoyu Shi, Zhaoyang Huang, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li

Abstract: FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance. The core component of FlowFormer is the transformer-based cost-volume encoder. Inspired by the recent success of masked autoencoding (MAE) pretraining in unleashing transformers' capacity of encoding visual representation, we propose Masked Cost Volume Autoencoding (MCVA) to enh… ▽ More FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance. The core component of FlowFormer is the transformer-based cost-volume encoder. Inspired by the recent success of masked autoencoding (MAE) pretraining in unleashing transformers' capacity of encoding visual representation, we propose Masked Cost Volume Autoencoding (MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to prevent masked information leakage, as the cost maps of neighboring source pixels are highly correlated. Secondly, we propose a novel pre-text reconstruction task, which encourages the cost-volume encoder to aggregate long-range information and ensures pretraining-finetuning consistency. We also show how to modify the FlowFormer architecture to accommodate masks during pretraining. Pretrained with MCVA, FlowFormer++ ranks 1st among published methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++ achieves 1.07 and 1.94 average end-point error (AEPE) on the clean and final pass of Sintel benchmark, leading to 7.76\% and 7.18\% error reductions from FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set, improving FlowFormer by 0.16. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2301.08859 [pdf, other]

Logical Message Passing Networks with One-hop Inference on Atomic Formulas

Authors: Zihao Wang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer the logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embedding… ▽ More Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer the logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from the zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries into the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the local one-hop inferences on atomic formulas to the global logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to finally predict the answers' embeddings. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph represenation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields the new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning. △ Less

Submitted 26 August, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

Comments: Accepted by ICLR 2023. 20 pages, 4 figures, and 9 tables. Our implementation can be found at https://github.com/HKUST-KnowComp/LMPNN . update v4: more accurate comparison about the computational cost between LMPNN and GNN-QE. update v3: typo fix. update v2: add code repository

arXiv:2301.00407 [pdf, other]

MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs

Authors: Huaizheng Zhang, Yuanming Li, Wencong Xiao, Yizheng Huang, Xing Di, Jianxiong Yin, Simon See, Yong Luo, Chiew Tong Lau, Yang You

Abstract: New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensiv… ▽ More New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon. △ Less

Submitted 1 January, 2023; originally announced January 2023.

Comments: 10 pages, 11 figures

arXiv:2212.08830 [pdf, other]

Inductive Attention for Video Action Anticipation

Authors: Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, Oswald Lanz

Abstract: Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time -- a model ca… ▽ More Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time -- a model cannot directly map the video frames to the target action to solve the anticipation task. Instead, the temporal inference is required to associate the relevant evidence with possible future actions. Consequently, existing solutions based on the action recognition models are only suboptimal. Recently, researchers proposed extending the observation window to capture longer pre-action profiles from past moments and leveraging attention to retrieve the subtle evidence to improve the anticipation predictions. However, existing attention designs typically use frame inputs as the query which is suboptimal, as a video frame only weakly connects to the future action. To this end, we propose an inductive attention model, dubbed IAM, which leverages the current prediction priors as the query to infer future action and can efficiently process the long video content. Furthermore, our method considers the uncertainty of the future via the many-to-many association in the attention design. As a result, IAM consistently outperforms the state-of-the-art anticipation models on multiple large-scale egocentric video datasets while using significantly fewer model parameters. △ Less

Submitted 18 March, 2023; v1 submitted 17 December, 2022; originally announced December 2022.

arXiv:2211.12759 [pdf, other]

NAS-LID: Efficient Neural Architecture Search with Local Intrinsic Dimension

Authors: Xin He, Jiangchao Yao, Yuxin Wang, Zhenheng Tang, Ka Chu Cheung, Simon See, Bo Han, Xiaowen Chu

Abstract: One-shot neural architecture search (NAS) substantially improves the search efficiency by training one supernet to estimate the performance of every possible child architecture (i.e., subnet). However, the inconsistency of characteristics among subnets incurs serious interference in the optimization, resulting in poor performance ranking correlation of subnets. Subsequent explorations decompose su… ▽ More One-shot neural architecture search (NAS) substantially improves the search efficiency by training one supernet to estimate the performance of every possible child architecture (i.e., subnet). However, the inconsistency of characteristics among subnets incurs serious interference in the optimization, resulting in poor performance ranking correlation of subnets. Subsequent explorations decompose supernet weights via a particular criterion, e.g., gradient matching, to reduce the interference; yet they suffer from huge computational cost and low space separability. In this work, we propose a lightweight and effective local intrinsic dimension (LID)-based method NAS-LID. NAS-LID evaluates the geometrical properties of architectures by calculating the low-cost LID features layer-by-layer, and the similarity characterized by LID enjoys better separability compared with gradients, which thus effectively reduces the interference among subnets. Extensive experiments on NASBench-201 indicate that NAS-LID achieves superior performance with better efficiency. Specifically, compared to the gradient-driven method, NAS-LID can save up to 86% of GPU memory overhead when searching on NASBench-201. We also demonstrate the effectiveness of NAS-LID on ProxylessNAS and OFA spaces. Source code: https://github.com/marsggbo/NAS-LID. △ Less

Submitted 24 November, 2022; v1 submitted 23 November, 2022; originally announced November 2022.

Comments: Accepted by AAAI2023, AutoML, NAS

arXiv:2211.03635 [pdf, other]

Complex Hyperbolic Knowledge Graph Embeddings with Fast Fourier Transform

Authors: Huiru Xiao, Xin Liu, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: The choice of geometric space for knowledge graph (KG) embeddings can have significant effects on the performance of KG completion tasks. The hyperbolic geometry has been shown to capture the hierarchical patterns due to its tree-like metrics, which addressed the limitations of the Euclidean embedding models. Recent explorations of the complex hyperbolic geometry further improved the hyperbolic em… ▽ More The choice of geometric space for knowledge graph (KG) embeddings can have significant effects on the performance of KG completion tasks. The hyperbolic geometry has been shown to capture the hierarchical patterns due to its tree-like metrics, which addressed the limitations of the Euclidean embedding models. Recent explorations of the complex hyperbolic geometry further improved the hyperbolic embeddings for capturing a variety of hierarchical structures. However, the performance of the hyperbolic KG embedding models for non-transitive relations is still unpromising, while the complex hyperbolic embeddings do not deal with multi-relations. This paper aims to utilize the representation capacity of the complex hyperbolic geometry in multi-relational KG embeddings. To apply the geometric transformations which account for different relations and the attention mechanism in the complex hyperbolic space, we propose to use the fast Fourier transform (FFT) as the conversion between the real and complex hyperbolic space. Constructing the attention-based transformations in the complex space is very challenging, while the proposed Fourier transform-based complex hyperbolic approaches provide a simple and effective solution. Experimental results show that our methods outperform the baselines, including the Euclidean and the real hyperbolic embedding models. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Aceepted by the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP22)

arXiv:2211.01412 [pdf, other]

doi 10.1109/JBHI.2024.3354712

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation

Authors: Jun Wang, Abhir Bhalerao, Terry Yin, Simon See, Yulan He

Abstract: Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages and aid the process of disease decision making by radiologists. Recent advancements in RRG are largely driven by improving a model's capabilities in encoding single-modal feature representations, while few studies explicitly explore the cross-modal alignme… ▽ More Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages and aid the process of disease decision making by radiologists. Recent advancements in RRG are largely driven by improving a model's capabilities in encoding single-modal feature representations, while few studies explicitly explore the cross-modal alignment between image regions and words. Radiologists typically focus first on abnormal image regions before composing the corresponding text descriptions, thus cross-modal alignment is of great importance to learn a RRG model which is aware of abnormalities in the image. Motivated by this, we propose a Class Activation Map guided Attention Network (CAMANet) which explicitly promotes crossmodal alignment by employing aggregated class activation maps to supervise cross-modal attention learning, and simultaneously enrich the discriminative information. CAMANet contains three complementary modules: a Visual Discriminative Map Generation module to generate the importance/contribution of each visual token; Visual Discriminative Map Assisted Encoder to learn the discriminative representation and enrich the discriminative information; and a Visual Textual Attention Consistency module to ensure the attention consistency between the visual and textual tokens, to achieve the cross-modal alignment. Experimental results demonstrate that CAMANet outperforms previous SOTA methods on two commonly used RRG benchmarks. △ Less

Submitted 3 March, 2024; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted to IEEE Journal of Biomedical and Health Informatics (IJBHI). 13 pages, 8 figures

arXiv:2210.07988 [pdf, other]

PseudoReasoner: Leveraging Pseudo Labels for Commonsense Knowledge Base Population

Authors: Tianqing Fang, Quyet V. Do, Hongming Zhang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: Commonsense Knowledge Base (CSKB) Population aims at reasoning over unseen entities and assertions on CSKBs, and is an important yet hard commonsense reasoning task. One challenge is that it requires out-of-domain generalization ability as the source CSKB for training is of a relatively smaller scale (1M) while the whole candidate space for population is way larger (200M). We propose PseudoReasone… ▽ More Commonsense Knowledge Base (CSKB) Population aims at reasoning over unseen entities and assertions on CSKBs, and is an important yet hard commonsense reasoning task. One challenge is that it requires out-of-domain generalization ability as the source CSKB for training is of a relatively smaller scale (1M) while the whole candidate space for population is way larger (200M). We propose PseudoReasoner, a semi-supervised learning framework for CSKB population that uses a teacher model pre-trained on CSKBs to provide pseudo labels on the unlabeled candidate dataset for a student model to learn from. The teacher can be a generative model rather than restricted to discriminative models as previous works. In addition, we design a new filtering procedure for pseudo labels based on influence function and the student model's prediction to further improve the performance. The framework can improve the backbone model KG-BERT (RoBERTa-large) by 3.3 points on the overall performance and especially, 5.3 points on the out-of-domain performance, and achieves the state-of-the-art. Codes and data are available at https://github.com/HKUST-KnowComp/PseudoReasoner. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Findings of EMNLP 2022

arXiv:2210.06694 [pdf, other]

SubeventWriter: Iterative Sub-event Sequence Generation with Coherence Controller

Authors: Zhaowei Wang, Hongming Zhang, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Abstract: In this paper, we propose a new task of sub-event generation for an unseen process to evaluate the understanding of the coherence of sub-event actions and objects. To solve the problem, we design SubeventWriter, a sub-event sequence generation framework with a coherence controller. Given an unseen process, the framework can iteratively construct the sub-event sequence by generating one sub-event a… ▽ More In this paper, we propose a new task of sub-event generation for an unseen process to evaluate the understanding of the coherence of sub-event actions and objects. To solve the problem, we design SubeventWriter, a sub-event sequence generation framework with a coherence controller. Given an unseen process, the framework can iteratively construct the sub-event sequence by generating one sub-event at each iteration. We also design a very effective coherence controller to decode more coherent sub-events. As our extensive experiments and analysis indicate, SubeventWriter can generate more reliable and meaningful sub-event sequences for unseen processes. △ Less

Submitted 19 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to the main conference of EMNLP 2022

arXiv:2210.00474 [pdf, other]

Saving the Lim**: Fault-tolerant Quadruped Locomotion via Reinforcement Learning

Authors: Dikai Liu, Tianwei Zhang, Jianxiong Yin, Simon See

Abstract: Modern quadrupeds are skillful in traversing or even sprinting on uneven terrains in a remote uncontrolled environment. However, survival in the wild requires not only maneuverability, but also the ability to handle potential critical hardware failures. How to grant such ability to quadrupeds is rarely investigated. In this paper, we propose a novel methodology to train and test hardware fault-tol… ▽ More Modern quadrupeds are skillful in traversing or even sprinting on uneven terrains in a remote uncontrolled environment. However, survival in the wild requires not only maneuverability, but also the ability to handle potential critical hardware failures. How to grant such ability to quadrupeds is rarely investigated. In this paper, we propose a novel methodology to train and test hardware fault-tolerant controllers for quadruped locomotion, both in the simulation and physical world. We adopt the teacher-student reinforcement learning framework to train the controller with close-to-reality joint-locking failure in the simulation, which can be zero-shot transferred to the physical robot without any fine-tuning. Extensive experiments show that our fault-tolerant controller can efficiently lead a quadruped stably when it faces joint failures during locomotion. △ Less

Submitted 7 September, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: This work has been submitted to IEEE RA-L for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Project website for video: https://johnliudk.github.io/saving-the-lim**/

arXiv:2207.01208 [pdf, other]

Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation

Authors: Sixing Yan, William K. Cheung, Keith Chiu, Terence M. Tong, Charles K. Cheung, Simon See

Abstract: Automatic generation of medical reports from X-ray images can assist radiologists to perform the time-consuming and yet important reporting task. Yet, achieving clinically accurate generated reports remains challenging. Modeling the underlying abnormalities using the knowledge graph approach has been found promising in enhancing the clinical accuracy. In this paper, we introduce a novel fined-grai… ▽ More Automatic generation of medical reports from X-ray images can assist radiologists to perform the time-consuming and yet important reporting task. Yet, achieving clinically accurate generated reports remains challenging. Modeling the underlying abnormalities using the knowledge graph approach has been found promising in enhancing the clinical accuracy. In this paper, we introduce a novel fined-grained knowledge graph structure called an attributed abnormality graph (ATAG). The ATAG consists of interconnected abnormality nodes and attribute nodes, allowing it to better capture the abnormality details. In contrast to the existing methods where the abnormality graph was constructed manually, we propose a methodology to automatically construct the fine-grained graph structure based on annotations, medical reports in X-ray datasets, and the RadLex radiology lexicon. We then learn the ATAG embedding using a deep model with an encoder-decoder architecture for the report generation. In particular, graph attention networks are explored to encode the relationships among the abnormalities and their attributes. A gating mechanism is adopted and integrated with various decoders for the generation. We carry out extensive experiments based on the benchmark datasets, and show that the proposed ATAG-based deep model outperforms the SOTA methods by a large margin and can improve the clinical accuracy of the generated reports. △ Less

Submitted 5 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: 14 pages, 7 figures

arXiv:2206.10869 [pdf, other]

NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022

Authors: Tsung-Ming Tai, Oswald Lanz, Giuseppe Fiameni, Yi-Kwan Wong, Sze-Sen Poon, Cheng-Kuang Lee, Ka-Chun Cheung, Simon See

Abstract: In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the pre… ▽ More In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard. △ Less

Submitted 22 June, 2022; originally announced June 2022.

arXiv:2206.10810 [pdf, other]

A Simple Baseline for Video Restoration with Grouped Spatial-temporal Shift

Authors: Dasong Li, Xiaoyu Shi, Yi Zhang, Ka Chun Cheung, Simon See, Xiaogang Wang, Hongwei Qin, Hongsheng Li

Abstract: Video restoration, which aims to restore clear frames from degraded videos, has numerous important applications. The key to video restoration depends on utilizing inter-frame information. However, existing deep learning methods often rely on complicated network architectures, such as optical flow estimation, deformable convolution, and cross-frame self-attention layers, resulting in high computati… ▽ More Video restoration, which aims to restore clear frames from degraded videos, has numerous important applications. The key to video restoration depends on utilizing inter-frame information. However, existing deep learning methods often rely on complicated network architectures, such as optical flow estimation, deformable convolution, and cross-frame self-attention layers, resulting in high computational costs. In this study, we propose a simple yet effective framework for video restoration. Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique that can implicitly capture inter-frame correspondences for multi-frame aggregation. By introducing grouped spatial shift, we attain expansive effective receptive fields. Combined with basic 2D convolution, this simple framework can effectively aggregate inter-frame information. Extensive experiments demonstrate that our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost, on both video deblurring and video denoising tasks. These results indicate the potential for our approach to significantly reduce computational overhead while maintaining high-quality results. Code is avaliable at https://github.com/dasongli1/Shift-Net. △ Less

Submitted 22 May, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Accepted to CVPR2023

Journal ref: 2023 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

arXiv:2206.01009 [pdf, other]

Unified Recurrence Modeling for Video Action Anticipation

Authors: Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, Oswald Lanz

Abstract: Forecasting future events based on evidence of current conditions is an innate skill of human beings, and key for predicting the outcome of any decision making. In artificial vision for example, we would like to predict the next human action before it happens, without observing the future video frames associated to it. Computer vision models for action anticipation are expected to collect the subt… ▽ More Forecasting future events based on evidence of current conditions is an innate skill of human beings, and key for predicting the outcome of any decision making. In artificial vision for example, we would like to predict the next human action before it happens, without observing the future video frames associated to it. Computer vision models for action anticipation are expected to collect the subtle evidence in the preamble of the target actions. In prior studies recurrence modeling often leads to better performance, the strong temporal inference is assumed to be a key element for reasonable prediction. To this end, we propose a unified recurrence modeling for video action anticipation via message passing framework. The information flow in space-time can be described by the interaction between vertices and edges, and the changes of vertices for each incoming frame reflects the underlying dynamics. Our model leverages self-attention as the building blocks for each of the message passing functions. In addition, we introduce different edge learning strategies that can be end-to-end optimized to gain better flexibility for the connectivity between vertices. Our experimental results demonstrate that our proposed method outperforms previous works on the large-scale EPIC-Kitchen dataset. △ Less

Submitted 2 June, 2022; originally announced June 2022.

arXiv:2110.09930 [pdf, other]

Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning

Authors: Yi-Chen Chen, Shu-wen Yang, Cheng-Kuang Lee, Simon See, Hung-yi Lee

Abstract: Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction. It has been shown that an SSL pretraining model can achieve excellent performance in various downstream tasks of speech processing. On the other hand, supervised multi-task learning (MTL) is another representation learning paradigm, which ha… ▽ More Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction. It has been shown that an SSL pretraining model can achieve excellent performance in various downstream tasks of speech processing. On the other hand, supervised multi-task learning (MTL) is another representation learning paradigm, which has been proven effective in computer vision (CV) and natural language processing (NLP). However, there is no systematic research on the general representation learning model trained by supervised MTL in speech processing. In this paper, we show that MTL finetuning can further improve SSL pretraining. We analyze the generalizability of supervised MTL finetuning to examine if the speech representation learned by MTL finetuning can generalize to unseen new tasks. △ Less

Submitted 18 October, 2021; originally announced October 2021.

arXiv:2109.12493 [pdf, other]

Self-Supervised Video Representation Learning by Video Incoherence Detection

Authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Lihua Xie, Jianxiong Yin, Simon See

Abstract: This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchicall… ▽ More This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, the training sample, denoted as the incoherent clip, is constructed by multiple sub-clips hierarchically sampled from the same raw video with various lengths of incoherence between each other. The network is trained to learn high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, intra-video contrastive learning is introduced to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval utilizing various backbone networks. Experiments show that our proposed method achieves state-of-the-art performance across different backbone networks and different datasets compared with previous coherence-based methods. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 11 pages, 7 figures

arXiv:2107.04932 [pdf, other]

doi 10.1109/TNNLS.2022.3212909

Aligning Correlation Information for Domain Adaptation in Action Recognition

Authors: Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianxiong Yin, Simon See

Abstract: Domain adaptation (DA) approaches address domain shift and enable networks to be applied to different scenarios. Although various image DA approaches have been proposed in recent years, there is limited research towards video DA. This is partly due to the complexity in adapting the different modalities of features in videos, which includes the correlation features extracted as long-term dependenci… ▽ More Domain adaptation (DA) approaches address domain shift and enable networks to be applied to different scenarios. Although various image DA approaches have been proposed in recent years, there is limited research towards video DA. This is partly due to the complexity in adapting the different modalities of features in videos, which includes the correlation features extracted as long-term dependencies of pixels across spatiotemporal dimensions. The correlation features are highly associated with action classes and proven their effectiveness in accurate video feature extraction through the supervised action recognition task. Yet correlation features of the same action would differ across domains due to domain shift. Therefore we propose a novel Adversarial Correlation Adaptation Network (ACAN) to align action videos by aligning pixel correlations. ACAN aims to minimize the distribution of correlation information, termed as Pixel Correlation Discrepancy (PCD). Additionally, video DA research is also limited by the lack of cross-domain video datasets with larger domain shifts. We, therefore, introduce a novel HMDB-ARID dataset with a larger domain shift caused by a larger statistical difference between domains. This dataset is built in an effort to leverage current datasets for dark video classification. Empirical results demonstrate the state-of-the-art performance of our proposed ACAN for both existing and the new video DA datasets. △ Less

Submitted 8 December, 2022; v1 submitted 10 July, 2021; originally announced July 2021.

Comments: The dataset HMDB-ARID is available at https://xuyu0010.github.io/vuda.html.Camera-ready version of this paper accepted at IEEE TNNLS. Correction made for Figure 1 of the Camera-ready version

arXiv:2106.03167 [pdf, other]

Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

Authors: Hyun Gon Ryu, Jeong-Hoon Kim, Simon See

Abstract: In this work, we propose a new mathematical vocoder algorithm(modified spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on C… ▽ More In this work, we propose a new mathematical vocoder algorithm(modified spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on CPU and 59.6MHz on GPU. This is 909 and 2,702 times faster compared to real-time. Since the proposed methodology is not a data-driven method, it is applicable to unseen voices and multiple languages without any additional work. The proposed method is expected to adapt for researching on neural network models capable of synthesizing speech at the studio recording level. △ Less

Submitted 15 June, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: see sample in https://its10041004.github.io/MVA/

arXiv:2008.11378 [pdf, other]

Effective Action Recognition with Embedded Key Point Shifts

Authors: Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Jianxiong Yin, Simon See

Abstract: Temporal feature extraction is an essential technique in video-based action recognition. Key points have been utilized in skeleton-based action recognition methods but they require costly key point annotation. In this paper, we propose a novel temporal feature extraction module, named Key Point Shifts Embedding Module ($KPSEM$), to adaptively extract channel-wise key point shifts across video fram… ▽ More Temporal feature extraction is an essential technique in video-based action recognition. Key points have been utilized in skeleton-based action recognition methods but they require costly key point annotation. In this paper, we propose a novel temporal feature extraction module, named Key Point Shifts Embedding Module ($KPSEM$), to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. Our method achieves competitive performance through embedding key point shifts with trivial computational cost, achieving the state-of-the-art performance of 82.05% on Mini-Kinetics and competitive performance on UCF101, Something-Something-v1, and HMDB51 datasets. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: 35 pages, 10 figures

arXiv:2008.01170 [pdf, other]

Deep Learning Models for Early Detection and Prediction of the spread of Novel Coronavirus (COVID-19)

Authors: Devante Ayris, Kye Horbury, Blake Williams, Mitchell Blackney, Celine Shi Hui See, Maleeha Imtiaz, Syed Afaq Ali Shah

Abstract: SARS-CoV2, which causes coronavirus disease (COVID-19) is continuing to spread globally and has become a pandemic. People have lost their lives due to the virus and the lack of counter measures in place. Given the increasing caseload and uncertainty of spread, there is an urgent need to develop machine learning techniques to predict the spread of COVID-19. Prediction of the spread can allow counte… ▽ More SARS-CoV2, which causes coronavirus disease (COVID-19) is continuing to spread globally and has become a pandemic. People have lost their lives due to the virus and the lack of counter measures in place. Given the increasing caseload and uncertainty of spread, there is an urgent need to develop machine learning techniques to predict the spread of COVID-19. Prediction of the spread can allow counter measures and actions to be implemented to mitigate the spread of COVID-19. In this paper, we propose a deep learning technique, called Deep Sequential Prediction Model (DSPM) and machine learning based Non-parametric Regression Model (NRM) to predict the spread of COVID-19. Our proposed models were trained and tested on novel coronavirus 2019 dataset, which contains 19.53 Million confirmed cases of COVID-19. Our proposed models were evaluated by using Mean Absolute Error and compared with baseline method. Our experimental results, both quantitative and qualitative, demonstrate the superior prediction performance of the proposed models. △ Less

Submitted 15 February, 2021; v1 submitted 29 July, 2020; originally announced August 2020.

arXiv:2007.03432 [pdf, other]

A deep learning based nonlinear upscaling method for transport equations

Authors: Tak Shing Au Yeung, Eric T. Chung, Simon See

Abstract: We will develop a nonlinear upscaling method for nonlinear transport equation. The proposed scheme gives a coarse scale equation for the cell average of the solution. In order to compute the parameters in the coarse scale equation, a local downscaling operator is constructed. This downscaling operation recovers fine scale properties using cell averages. This is achieved by solving the equation on… ▽ More We will develop a nonlinear upscaling method for nonlinear transport equation. The proposed scheme gives a coarse scale equation for the cell average of the solution. In order to compute the parameters in the coarse scale equation, a local downscaling operator is constructed. This downscaling operation recovers fine scale properties using cell averages. This is achieved by solving the equation on an oversampling region with the given cell average as constraint. Due to the nonlinearity, one needs to compute these downscaling operations on the fly and cannot pre-compute these quantities. In order to give an efficient downscaling operation, we apply a deep learning approach. We will use a deep neural network to approximate the downscaling operation. Our numerical results show that the proposed scheme can achieve a good accuracy and efficiency. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 18 pages

arXiv:2006.05091 [pdf, other]

PNL: Efficient Long-Range Dependencies Extraction with Pyramid Non-Local Module for Action Recognition

Authors: Yuecong Xu, Haozhi Cao, Jianfei Yang, Kezhi Mao, Jianxiong Yin, Simon See

Abstract: Long-range spatiotemporal dependencies capturing plays an essential role in improving video features for action recognition. The non-local block inspired by the non-local means is designed to address this challenge and have shown excellent performance. However, the non-local block brings significant increase in computation cost to the original network. It also lacks the ability to model regional c… ▽ More Long-range spatiotemporal dependencies capturing plays an essential role in improving video features for action recognition. The non-local block inspired by the non-local means is designed to address this challenge and have shown excellent performance. However, the non-local block brings significant increase in computation cost to the original network. It also lacks the ability to model regional correlation in videos. To address the above limitations, we propose Pyramid Non-Local (PNL) module, which extends the non-local block by incorporating regional correlation at multiple scales through a pyramid structured module. This extension upscales the effectiveness of non-local operation by attending to the interaction between different regions. Empirical results prove the effectiveness and efficiency of our PNL module, which achieves state-of-the-art performance of 83.09% on the Mini-Kinetics dataset, with decreased computation cost compared to the non-local block. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: Single column, 26 pages, 6 figures

arXiv:2006.03876 [pdf, other]

ARID: A New Dataset for Recognizing Action in the Dark

Authors: Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianxiong Yin, Simon See

Abstract: The task of action recognition in dark videos is useful in various scenarios, e.g., night surveillance and self-driving at night. Though progress has been made in the action recognition task for videos in normal illumination, few have studied action recognition in the dark. This is partly due to the lack of sufficient datasets for such a task. In this paper, we explored the task of action recognit… ▽ More The task of action recognition in dark videos is useful in various scenarios, e.g., night surveillance and self-driving at night. Though progress has been made in the action recognition task for videos in normal illumination, few have studied action recognition in the dark. This is partly due to the lack of sufficient datasets for such a task. In this paper, we explored the task of action recognition in dark videos. We bridge the gap of the lack of data for this task by collecting a new dataset: the Action Recognition in the Dark (ARID) dataset. It consists of over 3,780 video clips with 11 action categories. To the best of our knowledge, it is the first dataset focused on human actions in dark videos. To gain further understandings of our ARID dataset, we analyze the ARID dataset in detail and exhibited its necessity over synthetic dark videos. Additionally, we benchmarked the performance of several current action recognition models on our dataset and explored potential methods for increasing their performances. Our results show that current action recognition models and frame enhancement methods may not be effective solutions for the task of action recognition in dark videos. △ Less

Submitted 19 August, 2022; v1 submitted 6 June, 2020; originally announced June 2020.

Comments: 6 pages, 7 figures, Data available at https://xuyu0010.github.io/arid, simplified title, extension of IJCAIW version published by Springer (https://link.springer.com/chapter/10.1007/978-981-16-0575-8_6)

arXiv:2005.02591 [pdf, other]

Exploiting Inter-Frame Regional Correlation for Efficient Action Recognition

Authors: Yuecong Xu, Jianfei Yang, Kezhi Mao, Jianxiong Yin, Simon See

Abstract: Temporal feature extraction is an important issue in video-based action recognition. Optical flow is a popular method to extract temporal feature, which produces excellent performance thanks to its capacity of capturing pixel-level correlation information between consecutive frames. However, such a pixel-level correlation is extracted at the cost of high computational complexity and large storage… ▽ More Temporal feature extraction is an important issue in video-based action recognition. Optical flow is a popular method to extract temporal feature, which produces excellent performance thanks to its capacity of capturing pixel-level correlation information between consecutive frames. However, such a pixel-level correlation is extracted at the cost of high computational complexity and large storage resource. In this paper, we propose a novel temporal feature extraction method, named Attentive Correlated Temporal Feature (ACTF), by exploring inter-frame correlation within a certain region. The proposed ACTF exploits both bilinear and linear correlation between successive frames on the regional level. Our method has the advantage of achieving performance comparable to or better than optical flow-based methods while avoiding the introduction of optical flow. Experimental results demonstrate our proposed method achieves the state-of-the-art performances of 96.3% on UCF101 and 76.3% on HMDB51 benchmark datasets. △ Less

Submitted 6 May, 2020; originally announced May 2020.

Comments: 24 pages (exclude reference), 7 figures, 4 tables

arXiv:2003.03785 [pdf, ps, other]

Dependently Typed Knowledge Graphs

Authors: Zhangsheng Lai, Aik Beng Ng, Liang Ze Wong, Simon See, Shaowei Lin

Abstract: Reasoning over knowledge graphs is traditionally built upon a hierarchy of languages in the Semantic Web Stack. Starting from the Resource Description Framework (RDF) for knowledge graphs, more advanced constructs have been introduced through various syntax extensions to add reasoning capabilities to knowledge graphs. In this paper, we show how standardized semantic web technologies (RDF and its q… ▽ More Reasoning over knowledge graphs is traditionally built upon a hierarchy of languages in the Semantic Web Stack. Starting from the Resource Description Framework (RDF) for knowledge graphs, more advanced constructs have been introduced through various syntax extensions to add reasoning capabilities to knowledge graphs. In this paper, we show how standardized semantic web technologies (RDF and its query language SPARQL) can be reproduced in a unified manner with dependent type theory. In addition to providing the basic functionalities of knowledge graphs, dependent types add expressiveness in encoding both entities and queries, explainability in answers to queries through witnesses, and compositionality and automation in the construction of witnesses. Using the Coq proof assistant, we demonstrate how to build and query dependently typed knowledge graphs as a proof of concept for future works in this direction. △ Less

Submitted 8 March, 2020; originally announced March 2020.

arXiv:1911.08772 [pdf, other]

Understanding Top-k Sparsification in Distributed Deep Learning

Authors: Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See

Abstract: Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious i… ▽ More Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: 14 pages

arXiv:1907.04052 [pdf, other]

Improving Deep Lesion Detection Using 3D Contextual and Spatial Attention

Authors: Qingyi Tao, Zongyuan Ge, Jianfei Cai, Jianxiong Yin, Simon See

Abstract: Lesion detection from computed tomography (CT) scans is challenging compared to natural object detection because of two major reasons: small lesion size and small inter-class variation. Firstly, the lesions usually only occupy a small region in the CT image. The feature of such small region may not be able to provide sufficient information due to its limited spatial feature resolution. Secondly, i… ▽ More Lesion detection from computed tomography (CT) scans is challenging compared to natural object detection because of two major reasons: small lesion size and small inter-class variation. Firstly, the lesions usually only occupy a small region in the CT image. The feature of such small region may not be able to provide sufficient information due to its limited spatial feature resolution. Secondly, in CT scans, the lesions are often indistinguishable from the background since the lesion and non-lesion areas may have very similar appearances. To tackle both problems, we need to enrich the feature representation and improve the feature discriminativeness. Therefore, we introduce a dual-attention mechanism to the 3D contextual lesion detection framework, including the cross-slice contextual attention to selectively aggregate the information from different slices through a soft re-sampling process. Moreover, we propose intra-slice spatial attention to focus the feature learning in the most prominent regions. Our method can be easily trained end-to-end without adding heavy overhead on the base detection network. We use DeepLesion dataset and train a universal lesion detector to detect all kinds of lesions such as liver tumors, lung nodules, and so on. The results show that our model can significantly boost the results of the baseline lesion detector (with 3D contextual information) but using much fewer slices. △ Less

Submitted 9 July, 2019; originally announced July 2019.

Comments: Accepted by MICCAI 2019

arXiv:1810.04538 [pdf, other]

Secure Deep Learning Engineering: A Software Quality Assurance Perspective

Authors: Lei Ma, Felix Juefei-Xu, Minhui Xue, Qiang Hu, Sen Chen, Bo Li, Yang Liu, Jianjun Zhao, Jianxiong Yin, Simon See

Abstract: Over the past decades, deep learning (DL) systems have achieved tremendous success and gained great popularity in various applications, such as intelligent machines, image processing, speech processing, and medical diagnostics. Deep neural networks are the key driving force behind its recent success, but still seem to be a magic black box lacking interpretability and understanding. This brings up… ▽ More Over the past decades, deep learning (DL) systems have achieved tremendous success and gained great popularity in various applications, such as intelligent machines, image processing, speech processing, and medical diagnostics. Deep neural networks are the key driving force behind its recent success, but still seem to be a magic black box lacking interpretability and understanding. This brings up many open safety and security issues with enormous and urgent demands on rigorous methodologies and engineering practice for quality enhancement. A plethora of studies have shown that the state-of-the-art DL systems suffer from defects and vulnerabilities that can lead to severe loss and tragedies, especially when applied to real-world safety-critical applications. In this paper, we perform a large-scale study and construct a paper repository of 223 relevant works to the quality assurance, security, and interpretation of deep learning. We, from a software quality assurance perspective, pinpoint challenges and future opportunities towards universal secure deep learning engineering. We hope this work and the accompanied paper repository can pave the path for the software engineering community towards addressing the pressing industrial demand of secure intelligent applications. △ Less

Submitted 10 October, 2018; originally announced October 2018.

arXiv:1809.01266 [pdf, other]

DeepHunter: Hunting Deep Neural Network Defects via Coverage-Guided Fuzzing

Authors: Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Hongxu Chen, Minhui Xue, Bo Li, Yang Liu, Jianjun Zhao, Jianxiong Yin, Simon See

Abstract: In company with the data explosion over the past decade, deep neural network (DNN) based software has experienced unprecedented leap and is becoming the key driving force of many novel industrial applications, including many safety-critical scenarios such as autonomous driving. Despite great success achieved in various human intelligence tasks, similar to traditional software, DNNs could also exhi… ▽ More In company with the data explosion over the past decade, deep neural network (DNN) based software has experienced unprecedented leap and is becoming the key driving force of many novel industrial applications, including many safety-critical scenarios such as autonomous driving. Despite great success achieved in various human intelligence tasks, similar to traditional software, DNNs could also exhibit incorrect behaviors caused by hidden defects causing severe accidents and losses. In this paper, we propose DeepHunter, an automated fuzz testing framework for hunting potential defects of general-purpose DNNs. DeepHunter performs metamorphic mutation to generate new semantically preserved tests, and leverages multiple plugable coverage criteria as feedback to guide the test generation from different perspectives. To be scalable towards practical-sized DNNs, DeepHunter maintains multiple tests in a batch, and prioritizes the tests selection based on active feedback. The effectiveness of DeepHunter is extensively investigated on 3 popular datasets (MNIST, CIFAR-10, ImageNet) and 7 DNNs with diverse complexities, under a large set of 6 coverage criteria as feedback. The large-scale experiments demonstrate that DeepHunter can (1) significantly boost the coverage with guidance; (2) generate useful tests to detect erroneous behaviors and facilitate the DNN model quality evaluation; (3) accurately capture potential defects during DNN quantization for platform migration. △ Less

Submitted 16 November, 2018; v1 submitted 4 September, 2018; originally announced September 2018.

arXiv:1801.09335 [pdf, other]

Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks

Authors: Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, Yap-Peng Tan

Abstract: It is desirable to train convolutional networks (CNNs) to run more efficiently during inference. In many cases however, the computational budget that the system has for inference cannot be known beforehand during training, or the inference budget is dependent on the changing real-time resource availability. Thus, it is inadequate to train just inference-efficient CNNs, whose inference costs are no… ▽ More It is desirable to train convolutional networks (CNNs) to run more efficiently during inference. In many cases however, the computational budget that the system has for inference cannot be known beforehand during training, or the inference budget is dependent on the changing real-time resource availability. Thus, it is inadequate to train just inference-efficient CNNs, whose inference costs are not adjustable and cannot adapt to varied inference budgets. We propose a novel approach for cost-adjustable inference in CNNs - Stochastic Downsampling Point (SDPoint). During training, SDPoint applies feature map downsampling to a random point in the layer hierarchy, with a random downsampling ratio. The different stochastic downsampling configurations known as SDPoint instances (of the same model) have computational costs different from each other, while being trained to minimize the same prediction loss. Sharing network parameters across different instances provides significant regularization boost. During inference, one may handpick a SDPoint instance that best fits the inference budget. The effectiveness of SDPoint, as both a cost-adjustable inference approach and a regularizer, is validated through extensive experiments on image classification. △ Less

Submitted 28 January, 2018; originally announced January 2018.

Showing 1–50 of 54 results for author: See, S