-
Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy
Authors:
Nicole Heng Yim Oo,
Min Hun Lee,
Jeong Hoon Lim
Abstract:
Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessment by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes unstructured data (i.e. an image frame with facial line segments) and structured data (i.e. features of facial expressions) to detect facial palsy…
▽ More
Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessment by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes unstructured data (i.e. an image frame with facial line segments) and structured data (i.e. features of facial expressions) to detect facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 21 facial palsy patients. Our experimental results show that among various data modalities (i.e. unstructured data - RGB images and images of facial line segments and structured data - coordinates of facial landmarks and features of facial expressions), the feed-forward neural network using features of facial expression achieved the highest precision of 76.22 while the ResNet-based model using images of facial line segments achieved the highest recall of 83.47. When we leveraged both images of facial line segments and features of facial expressions, our multimodal fusion-based deep learning model slightly improved the precision score to 77.05 at the expense of a decrease in the recall score.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Bridging the Intent Gap: Knowledge-Enhanced Visual Generation
Authors:
Yi Cheng,
Ziwei Xu,
Dongyun Lin,
Harry Cheng,
Yongkang Wong,
Ying Sun,
Joo Hwee Lim,
Mohan Kankanhalli
Abstract:
For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leadi…
▽ More
For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leading to a mismatch between the desired and generated output. Second, generative models trained on visual-label pairs lack the comprehensive knowledge to accurately represent all aspects of the input data in their generated outputs. To address these challenges, we propose a knowledge-enhanced iterative refinement framework for visual content generation. We begin by analyzing and identifying the key challenges faced by existing generative models. Then, we introduce various knowledge sources, including human insights, pre-trained models, logic rules, and world knowledge, which can be leveraged to address these challenges. Furthermore, we propose a novel visual generation framework that incorporates a knowledge-based feedback module to iteratively refine the generation process. This module gradually improves the alignment between the generated content and user intentions. We demonstrate the efficacy of the proposed framework through preliminary results, highlighting the potential of knowledge-enhanced generative models for intention-aligned content generation.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seong** Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in develo** their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention
Authors:
Burak Satar,
Hongyuan Zhu,
Hanwang Zhang,
Joo Hwee Lim
Abstract:
Many studies focus on improving pretraining or develo** new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correla…
▽ More
Many studies focus on improving pretraining or develo** new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
An Overview of Challenges in Egocentric Text-Video Retrieval
Authors:
Burak Satar,
Hongyuan Zhu,
Hanwang Zhang,
Joo Hwee Lim
Abstract:
Text-video retrieval contains various challenges, including biases coming from diverse sources. We highlight some of them supported by illustrations to open a discussion. Besides, we address one of the biases, frame length bias, with a simple method which brings a very incremental but promising increase. We conclude with future directions.
Text-video retrieval contains various challenges, including biases coming from diverse sources. We highlight some of them supported by illustrations to open a discussion. Besides, we address one of the biases, frame length bias, with a simple method which brings a very incremental but promising increase. We conclude with future directions.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Adaptive Learning based Upper-Limb Rehabilitation Training System with Collaborative Robot
Authors:
Jun Hong Lim,
Kaibo He,
Zeji Yi,
Chen Hou,
Chen Zhang,
Yanan Sui,
Luming Li
Abstract:
Rehabilitation training for patients with motor disabilities usually requires specialized devices in rehabilitation centers. Home-based multi-purpose training would significantly increase treatment accessibility and reduce medical costs. While it is unlikely to equip a set of rehabilitation robots at home, we investigate the feasibility to use the general-purpose collaborative robot for rehabilita…
▽ More
Rehabilitation training for patients with motor disabilities usually requires specialized devices in rehabilitation centers. Home-based multi-purpose training would significantly increase treatment accessibility and reduce medical costs. While it is unlikely to equip a set of rehabilitation robots at home, we investigate the feasibility to use the general-purpose collaborative robot for rehabilitation therapies. In this work, we developed a new system for multi-purpose upper-limb rehabilitation training using a generic robot arm with human motor feedback and preference. We integrated surface electromyography, force/torque sensors, RGB-D cameras, and robot controllers with the Robot Operating System to enable sensing, communication, and control of the system. Imitation learning methods were adopted to imitate expert-provided training trajectories which could adapt to subject capabilities to facilitate in-home training. Our rehabilitation system is able to perform gross motor function and fine motor skill training with a gripper-based end-effector. We simulated system control in Gazebo and training effects (muscle activation level) in OpenSim and evaluated its real performance with human subjects. For all the subjects enrolled, our system achieved better training outcomes compared to specialist-assisted rehabilitation under the same conditions. Our work demonstrates the potential of utilizing collaborative robots for in-home motor rehabilitation training.
△ Less
Submitted 12 July, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Score-based Diffusion Models in Function Space
Authors:
Jae Hyun Lim,
Nikola B. Kovachki,
Ricardo Baptista,
Christopher Beckham,
Kamyar Azizzadenesheli,
Jean Kossaifi,
Vikram Voleti,
Jiaming Song,
Karsten Kreis,
Jan Kautz,
Christopher Pal,
Arash Vahdat,
Anima Anandkumar
Abstract:
Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many…
▽ More
Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many domains where the data has a functional form such as in scientific computing and 3D geometric data analysis. In this work, we introduce a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space. In DDOs, the forward process perturbs input functions gradually using a Gaussian process. The generative process is formulated by integrating a function-valued Langevin dynamic. Our approach requires an appropriate notion of the score for the perturbed data distribution, which we obtain by generalizing denoising score matching to function spaces that can be infinite-dimensional. We show that the corresponding discretized algorithm generates accurate samples at a fixed cost that is independent of the data resolution. We theoretically and numerically verify the applicability of our approach on a set of problems, including generating solutions to the Navier-Stokes equation viewed as the push-forward distribution of forcings from a Gaussian Random Field (GRF).
△ Less
Submitted 22 November, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Is Bio-Inspired Learning Better than Backprop? Benchmarking Bio Learning vs. Backprop
Authors:
Manas Gupta,
Sarthak Ketanbhai Modi,
Hang Zhang,
Joon Hei Lee,
Joo Hwee Lim
Abstract:
Bio-inspired learning has been gaining popularity recently given that Backpropagation (BP) is not considered biologically plausible. Many algorithms have been proposed in the literature which are all more biologically plausible than BP. However, apart from overcoming the biological implausibility of BP, a strong motivation for using Bio-inspired algorithms remains lacking. In this study, we undert…
▽ More
Bio-inspired learning has been gaining popularity recently given that Backpropagation (BP) is not considered biologically plausible. Many algorithms have been proposed in the literature which are all more biologically plausible than BP. However, apart from overcoming the biological implausibility of BP, a strong motivation for using Bio-inspired algorithms remains lacking. In this study, we undertake a holistic comparison of BP vs. multiple Bio-inspired algorithms to answer the question of whether Bio-learning offers additional benefits over BP. We test Bio-algorithms under different design choices such as access to only partial training data, resource constraints in terms of the number of training epochs, sparsification of the neural network parameters and addition of noise to input samples. Through these experiments, we notably find two key advantages of Bio-algorithms over BP. Firstly, Bio-algorithms perform much better than BP when the entire training dataset is not supplied. Four of the five Bio-algorithms tested outperform BP by upto 5% accuracy when only 20% of the training dataset is available. Secondly, even when the full dataset is available, Bio-algorithms learn much quicker and converge to a stable accuracy in far lesser training epochs than BP. Hebbian learning, specifically, is able to learn in just 5 epochs compared to around 100 epochs required by BP. These insights present practical reasons for utilising Bio-learning beyond just their biological plausibility and also point towards interesting new directions for future work on Bio-learning.
△ Less
Submitted 30 August, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Unveiling the Tapestry: the Interplay of Generalization and Forgetting in Continual Learning
Authors:
Zenglin Shi,
**g Jie,
Ying Sun,
Joo Hwee Lim,
Mengmi Zhang
Abstract:
In AI, generalization refers to a model's ability to perform well on out-of-distribution data related to the given task, beyond the data it was trained on. For an AI agent to excel, it must also possess the continual learning capability, whereby an agent incrementally learns to perform a sequence of tasks without forgetting the previously acquired knowledge to solve the old tasks. Intuitively, gen…
▽ More
In AI, generalization refers to a model's ability to perform well on out-of-distribution data related to the given task, beyond the data it was trained on. For an AI agent to excel, it must also possess the continual learning capability, whereby an agent incrementally learns to perform a sequence of tasks without forgetting the previously acquired knowledge to solve the old tasks. Intuitively, generalization within a task allows the model to learn underlying features that can readily be applied to novel tasks, facilitating quicker learning and enhanced performance in subsequent tasks within a continual learning framework. Conversely, continual learning methods often include mechanisms to mitigate catastrophic forgetting, ensuring that knowledge from earlier tasks is retained. This preservation of knowledge over tasks plays a role in enhancing generalization for the ongoing task at hand. Despite the intuitive appeal of the interplay of both abilities, existing literature on continual learning and generalization has proceeded separately. In the preliminary effort to promote studies that bridge both fields, we first present empirical evidence showing that each of these fields has a mutually positive effect on the other. Next, building upon this finding, we introduce a simple and effective technique known as Shape-Texture Consistency Regularization (STCR), which caters to continual learning. STCR learns both shape and texture representations for each task, consequently enhancing generalization and thereby mitigating forgetting. Remarkably, extensive experiments validate that our STCR, can be seamlessly integrated with existing continual learning methods, where its performance surpasses these continual learning methods in isolation or when combined with established generalization techniques by a large margin. Our data and source code will be made publicly available upon publication.
△ Less
Submitted 17 January, 2024; v1 submitted 20 November, 2022;
originally announced November 2022.
-
Portmanteauing Features for Scene Text Recognition
Authors:
Yew Lee Tan,
Ernest Yu Kai Chew,
Adams Wai-Kin Kong,
Jung-Jae Kim,
Joo Hwee Lim
Abstract:
Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the…
▽ More
Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the rectification network improves the overall text recognition performance. However, in some cases, the rectification network generates unnecessary distortions on images, resulting in incorrect predictions in images that would have otherwise been correct without it. In order to alleviate the unnecessary distortions, the portmanteauing of features is proposed. The portmanteau feature, inspired by the portmanteau word, is a feature containing information from both the original text image and the rectified image. To generate the portmanteau feature, a non-linear input pipeline with a block matrix initialization is presented. In this work, the transformer is chosen as the recognition network due to its utilization of attention and inherent parallelism, which can effectively handle the portmanteau feature. The proposed method is examined on 6 benchmarks and compared with 13 state-of-the-art methods. The experimental results show that the proposed method outperforms the state-of-the-art methods on various of the benchmarks.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition
Authors:
Mei Chee Leong,
Haosong Zhang,
Hui Li Tan,
Liyuan Li,
Joo Hwee Lim
Abstract:
Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's mode…
▽ More
Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional text input and learn cross association between visual and text semantics. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association, with improved recognition performance over CNN vision model. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
Authors:
Burak Satar,
Hongyuan Zhu,
Hanwang Zhang,
Joo Hwee Lim
Abstract:
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized D…
▽ More
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.
△ Less
Submitted 26 September, 2023; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Authors:
Burak Satar,
Hongyuan Zhu,
Xavier Bresson,
Joo Hwee Lim
Abstract:
With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer tha…
▽ More
With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Authors:
Burak Satar,
Hongyuan Zhu,
Hanwang Zhang,
Joo Hwee Lim
Abstract:
Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting o…
▽ More
Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations.
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results indicate that our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same visual backbone without pre-training. Finally, we conducted extensive ablation studies to elucidate our design choices.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
FashionSearchNet-v2: Learning Attribute Representations with Localization for Image Retrieval with Attribute Manipulation
Authors:
Kenan E. Ak,
Joo Hwee Lim,
Ying Sun,
Jo Yew Tham,
Ashraf A. Kassim
Abstract:
The focus of this paper is on the problem of image retrieval with attribute manipulation. Our proposed work is able to manipulate the desired attributes of the query image while maintaining its other attributes. For example, the collar attribute of the query image can be changed from round to v-neck to retrieve similar images from a large dataset. A key challenge in e-commerce is that images have…
▽ More
The focus of this paper is on the problem of image retrieval with attribute manipulation. Our proposed work is able to manipulate the desired attributes of the query image while maintaining its other attributes. For example, the collar attribute of the query image can be changed from round to v-neck to retrieve similar images from a large dataset. A key challenge in e-commerce is that images have multiple attributes where users would like to manipulate and it is important to estimate discriminative feature representations for each of these attributes. The proposed FashionSearchNet-v2 architecture is able to learn attribute specific representations by leveraging on its weakly-supervised localization module, which ignores the unrelated features of attributes in the feature space, thus improving the similarity learning. The network is jointly trained with the combination of attribute classification and triplet ranking loss to estimate local representations. These local representations are then merged into a single global representation based on the instructed attribute manipulation where desired images can be retrieved with a distance metric. The proposed method also provides explainability for its retrieval process to help provide additional information on the attention of the network. Experiments performed on several datasets that are rich in terms of the number of attributes show that FashionSearchNet-v2 outperforms the other state-of-the-art attribute manipulation techniques. Different than our earlier work (FashionSearchNet), we propose several improvements in the learning procedure and show that the proposed FashionSearchNet-v2 can be generalized to different domains other than fashion.
△ Less
Submitted 28 November, 2021;
originally announced November 2021.
-
Joint Learning On The Hierarchy Representation for Fine-Grained Human Action Recognition
Authors:
Mei Chee Leong,
Hui Li Tan,
Haosong Zhang,
Liyuan Li,
Feng Lin,
Joo Hwee Lim
Abstract:
Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recogni…
▽ More
Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recognition. The multi-task network consists of three pathways of SlowOnly networks with gradually increased frame rates for events, sets and elements of fine-grained actions, followed by our proposed integration layers for joint learning and prediction. It is a two-stage approach, where it first learns deep feature representation at each hierarchical level, and is followed by feature encoding and fusion for multi-task learning. Our empirical results on the FineGym dataset achieve a new state-of-the-art performance, with 91.80% Top-1 accuracy and 88.46% mean accuracy for element actions, which are 3.40% and 7.26% higher than the previous best results.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
A Variational Perspective on Diffusion-Based Generative Models and Score Matching
Authors:
Chin-Wei Huang,
Jae Hyun Lim,
Aaron Courville
Abstract:
Discrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes that transform data into noise can be reversed via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an…
▽ More
Discrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes that transform data into noise can be reversed via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an inverse formula to define a generative diffusion process. Despite the empirical success, a theoretical underpinning of this procedure is still lacking. In this work, we approach the (continuous-time) generative diffusion directly and derive a variational framework for likelihood estimation, which includes continuous-time normalizing flows as a special case, and can be seen as an infinitely deep variational autoencoder. Under this framework, we show that minimizing the score-matching loss is equivalent to maximizing a lower bound of the likelihood of the plug-in reverse SDE proposed by Song et al. (2021), bridging the theoretical gap.
△ Less
Submitted 29 September, 2021; v1 submitted 5 June, 2021;
originally announced June 2021.
-
A Data-driven Event Generator for Hadron Colliders using Wasserstein Generative Adversarial Network
Authors:
Suyong Choi,
Jae Hoon Lim
Abstract:
Highly reliable Monte-Carlo event generators and detector simulation programs are important for the precision measurement in the high energy physics. Huge amounts of computing resources are required to produce a sufficient number of simulated events. Moreover, simulation parameters have to be fine-tuned to reproduce situations in the high energy particle interactions which is not trivial in some p…
▽ More
Highly reliable Monte-Carlo event generators and detector simulation programs are important for the precision measurement in the high energy physics. Huge amounts of computing resources are required to produce a sufficient number of simulated events. Moreover, simulation parameters have to be fine-tuned to reproduce situations in the high energy particle interactions which is not trivial in some phase spaces in physics interests. In this paper, we suggest a new method based on the Wasserstein Generative Adversarial Network (WGAN) that can learn the probability distribution of the real data. Our method is capable of event generation at a very short computing time compared to the traditional MC generators. The trained WGAN is able to reproduce the shape of the real data with high fidelity.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Protect, Show, Attend and Tell: Empowering Image Captioning Models with Ownership Protection
Authors:
Jian Han Lim,
Chee Seng Chan,
Kam Woh Ng,
Lixin Fan,
Qiang Yang
Abstract:
By and large, existing Intellectual Property (IP) protection on deep neural networks typically i) focus on image classification task only, and ii) follow a standard digital watermarking framework that was conventionally used to protect the ownership of multimedia and video content. This paper demonstrates that the current digital watermarking framework is insufficient to protect image captioning t…
▽ More
By and large, existing Intellectual Property (IP) protection on deep neural networks typically i) focus on image classification task only, and ii) follow a standard digital watermarking framework that was conventionally used to protect the ownership of multimedia and video content. This paper demonstrates that the current digital watermarking framework is insufficient to protect image captioning tasks that are often regarded as one of the frontiers AI problems. As a remedy, this paper studies and proposes two different embedding schemes in the hidden memory state of a recurrent neural network to protect the image captioning model. From empirical points, we prove that a forged key will yield an unusable image captioning model, defeating the purpose of infringement. To the best of our knowledge, this work is the first to propose ownership protection on image captioning task. Also, extensive experiments show that the proposed method does not compromise the original image captioning performance on all common captioning metrics on Flickr30k and MS-COCO datasets, and at the same time it is able to withstand both removal and ambiguity attacks. Code is available at https://github.com/jianhanlim/ipr-imagecaptioning
△ Less
Submitted 31 August, 2021; v1 submitted 25 August, 2020;
originally announced August 2020.
-
AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation
Authors:
Jae Hyun Lim,
Aaron Courville,
Christopher Pal,
Chin-Wei Huang
Abstract:
Entropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduc…
▽ More
Entropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduce the error of the gradient approximator by approaching asymptotic optimality of a regular DAE, in which case the estimation is in theory unbiased. We conduct theoretical and experimental analyses on the approximation error of the proposed method, as well as extensive studies on heuristics to ensure its robustness. Finally, using the proposed gradient approximator to estimate the gradient of entropy, we demonstrate state-of-the-art performance on density estimation with variational autoencoders and continuous control with soft actor-critic.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Development of Linear Astigmatism Free -- Three Mirror System (LAF-TMS)
Authors:
Woo** Park,
Seunghyuk Chang,
Jae Hyuk Lim,
Sunwoo Lee,
Hojae Ahn,
Yunjong Kim,
Sanghyuk Kim,
Arvid Hammar,
Byeongjoon Jeong,
Geon Hee Kim,
Hyoungkwon Lee,
Dae Wook Kim,
Soojong Pak
Abstract:
We present the development of Linear Astigmatism Free - Three Mirror System (LAF-TMS). This is a prototype of an off-axis telescope that enables very wide field of view (FoV) infrared satellites that can observe Paschen-$α$ emission, zodiacal light, integrated star light, and other infrared sources. It has the entrance pupil diameter of 150 mm, the focal length of 500 mm, and the FoV of 5.5…
▽ More
We present the development of Linear Astigmatism Free - Three Mirror System (LAF-TMS). This is a prototype of an off-axis telescope that enables very wide field of view (FoV) infrared satellites that can observe Paschen-$α$ emission, zodiacal light, integrated star light, and other infrared sources. It has the entrance pupil diameter of 150 mm, the focal length of 500 mm, and the FoV of 5.5$^\circ$ $\times$ 4.1$^\circ$. LAF-TMS is an obscuration-free off-axis system with minimal out-of-field baffling and no optical support structure diffraction. This optical design is analytically optimized to remove linear astigmatism and to reduce high-order aberrations. Sensitivity analysis and Monte-Carlo simulation reveal that tilt errors are the most sensitive alignment parameters that allow $\sim$1$^\prime$. Optomechanical structure accurately mounts aluminum mirrors, and withstands satellite-level vibration environments. LAF-TMS shows optical performance with 37 $μ$m FWHM of the point source image satisfying Nyquist sampling requirements for typical 18 $μ$m pixel Infrared array detectors. The surface figure errors of mirrors and scattered light from the tertiary mirror with 4.9 nm surface micro roughness may affect the measured point spread function (PSF). Optical tests successfully demonstrate constant optical performance over wide FoV, indicating that LAF-TMS suppresses linear astigmatism and high-order aberrations.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Neural Multisensory Scene Inference
Authors:
Jae Hyun Lim,
Pedro O. Pinheiro,
Negar Rostamzadeh,
Christopher Pal,
Sung** Ahn
Abstract:
For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Netwo…
▽ More
For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE).
△ Less
Submitted 7 November, 2019; v1 submitted 5 October, 2019;
originally announced October 2019.
-
Variational Prototype Replays for Continual Learning
Authors:
Mengmi Zhang,
Tao Wang,
Joo Hwee Lim,
Gabriel Kreiman,
Jiashi Feng
Abstract:
Continual learning refers to the ability to acquire and transfer knowledge without catastrophically forgetting what was previously learned. In this work, we consider \emph{few-shot} continual learning in classification tasks, and we propose a novel method, Variational Prototype Replays, that efficiently consolidates and recalls previous knowledge to avoid catastrophic forgetting. In each classific…
▽ More
Continual learning refers to the ability to acquire and transfer knowledge without catastrophically forgetting what was previously learned. In this work, we consider \emph{few-shot} continual learning in classification tasks, and we propose a novel method, Variational Prototype Replays, that efficiently consolidates and recalls previous knowledge to avoid catastrophic forgetting. In each classification task, our method learns a set of variational prototypes with their means and variances, where embedding of the samples from the same class can be represented in a prototypical distribution and class-representative prototypes are separated apart. To alleviate catastrophic forgetting, our method replays one sample per class from previous tasks, and correspondingly matches newly predicted embeddings to their nearest class-representative prototypes stored from previous tasks. Compared with recent continual learning approaches, our method can readily adapt to new tasks with more classes without requiring the addition of new units. Furthermore, our method is more memory efficient since only class-representative prototypes with their means and variances, as well as only one sample per class from previous tasks need to be stored. Without tampering with the performance on initial tasks, our method learns novel concepts given a few training examples of each class in new tasks.
△ Less
Submitted 15 February, 2020; v1 submitted 22 May, 2019;
originally announced May 2019.
-
Egocentric Spatial Memory
Authors:
Mengmi Zhang,
Keng Teck Ma,
Shih-Cheng Yen,
Joo Hwee Lim,
Qi Zhao,
Jiashi Feng
Abstract:
Egocentric spatial memory (ESM) defines a memory system with encoding, storing, recognizing and recalling the spatial information about the environment from an egocentric perspective. We introduce an integrated deep neural network architecture for modeling ESM. It learns to estimate the occupancy state of the world and progressively construct top-down 2D global maps from egocentric views in a spat…
▽ More
Egocentric spatial memory (ESM) defines a memory system with encoding, storing, recognizing and recalling the spatial information about the environment from an egocentric perspective. We introduce an integrated deep neural network architecture for modeling ESM. It learns to estimate the occupancy state of the world and progressively construct top-down 2D global maps from egocentric views in a spatially extended environment. During the exploration, our proposed ESM model updates belief of the global map based on local observations using a recurrent neural network. It also augments the local map** with a novel external memory to encode and store latent representations of the visited places over long-term exploration in large environments which enables agents to perform place recognition and hence, loop closure. Our proposed ESM network contributes in the following aspects: (1) without feature engineering, our model predicts free space based on egocentric views efficiently in an end-to-end manner; (2) different from other deep learning-based map** system, ESMN deals with continuous actions and states which is vitally important for robotic control in real applications. In the experiments, we demonstrate its accurate and robust global map** capacities in 3D virtual mazes and realistic indoor environments by comparing with several competitive baselines.
△ Less
Submitted 31 July, 2018;
originally announced July 2018.
-
Finding any Waldo: zero-shot invariant and efficient visual search
Authors:
Mengmi Zhang,
Jiashi Feng,
Keng Teck Ma,
Joo Hwee Lim,
Qi Zhao,
Gabriel Kreiman
Abstract:
Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work has focused…
▽ More
Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work has focused on searching for perfect matches of a target after extensive category-specific training. Here we show for the first time that humans can efficiently and invariantly search for natural objects in complex scenes. To gain insight into the mechanisms that guide visual search, we propose a biologically inspired computational model that can locate targets without exhaustive sampling and generalize to novel objects. The model provides an approximation to the mechanisms integrating bottom-up and top-down signals during search in natural scenes.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
High Rate RPC detector for LHC
Authors:
F. Lagarde,
A. Fagot,
M. Gul,
C. Roskas,
M. Tytgat,
N. Zaganidis,
S. Fonseca De Souza,
A. Santoro,
F. Torres Da Silva De Araujo,
A. Aleksandrov,
R. Hadjiiska,
P. Iaydjiev,
M. Rodozov,
M. Shopova,
G. Sultanov,
A. Dimitrov,
L. Litov,
B. Pavlov,
P. Petkov,
A. Petrov,
S. J. Qian,
D. Han,
W. Yi,
C. Avila,
A. Cabrera
, et al. (77 additional authors not shown)
Abstract:
The High Luminosity LHC (HL-LHC) phase is designed to increase by an order of magnitude the amount of data to be collected by the LHC experiments. The foreseen gradual increase of the instantaneous luminosity of up to more than twice its nominal value of $10\times10^{34}\
{\rm cm}^{-1}{\rm s}^{-2}$ during Phase I and Phase II of the LHC running, presents special challenges for the experiments. The…
▽ More
The High Luminosity LHC (HL-LHC) phase is designed to increase by an order of magnitude the amount of data to be collected by the LHC experiments. The foreseen gradual increase of the instantaneous luminosity of up to more than twice its nominal value of $10\times10^{34}\
{\rm cm}^{-1}{\rm s}^{-2}$ during Phase I and Phase II of the LHC running, presents special challenges for the experiments. The region with high pseudo rapidity ($η$) region of the forward muon spectrometer ($2.4 > |η| > 1.9$) is not equipped with RPC stations. The increase of the expected particles rate up to 2 kHz cm$^{-1}$ ( including a safety factor 3 ) motivates the installation of RPC chambers to guarantee redundancy with the CSC chambers already present. The current CMS RPC technology cannot sustain the expected background level. A new generation of Glass-RPC (GRPC) using low-resistivity glass was proposed to equip the two most far away of the four high $η$ muon stations of CMS. In their single-gap version they can stand rates of few kHz cm$^{-1}$. Their time precision of about 1 ns can allow to reduce the noise contribution leading to an improvement of the trigger rate. The proposed design for large size chambers is examined and some preliminary results obtained during beam tests at Gamma Irradiation Facility (GIF++) and Super Proton Synchrotron (SPS) at CERN are shown. They were performed to validate the capability of such detectors to support high irradiation environment with limited consequence on their efficiency.
△ Less
Submitted 16 July, 2018;
originally announced July 2018.
-
Geometric GAN
Authors:
Jae Hyun Lim,
Jong Chul Ye
Abstract:
Generative Adversarial Nets (GANs) represent an important milestone for effective generative models, which has inspired numerous variants seemingly different from each other. One of the main contributions of this paper is to reveal a unified geometric structure in GAN and its variants. Specifically, we show that the adversarial generative model training can be decomposed into three geometric steps…
▽ More
Generative Adversarial Nets (GANs) represent an important milestone for effective generative models, which has inspired numerous variants seemingly different from each other. One of the main contributions of this paper is to reveal a unified geometric structure in GAN and its variants. Specifically, we show that the adversarial generative model training can be decomposed into three geometric steps: separating hyperplane search, discriminator parameter update away from the separating hyperplane, and the generator update along the normal vector direction of the separating hyperplane. This geometric intuition reveals the limitations of the existing approaches and leads us to propose a new formulation called geometric GAN using SVM separating hyperplane that maximizes the margin. Our theoretical analysis shows that the geometric GAN converges to a Nash equilibrium between the discriminator and generator. In addition, extensive numerical results show that the superior performance of geometric GAN.
△ Less
Submitted 8 May, 2017; v1 submitted 8 May, 2017;
originally announced May 2017.
-
R&D towards the CMS RPC Phase-2 upgrade
Authors:
A. Fagot,
A. Cimmino,
S. Crucy,
M. Gul,
A. A. O. Rios,
M. Tytgat,
N. Zaganidis,
S. Aly,
Y. Assran,
A. Radi,
A. Sayed,
G. Singh,
M. Abbrescia,
G. Iaselli,
M. Maggi,
G. Pugliese,
P. Verwilligen,
W. Van Doninck,
S. Colafranceschi,
A. Sharma,
L. Benussi,
S. Bianco,
D. Piccolo,
F. Primavera,
V. Bhatnagar
, et al. (71 additional authors not shown)
Abstract:
The high pseudo-rapidity region of the CMS muon system is covered by Cathode Strip Chambers (CSC) only and lacks redundant coverage despite the fact that it is a challenging region for muons in terms of backgrounds and momentum resolution. In order to maintain good efficiency for the muon trigger in this region additional RPCs are planned to be installed in the two outermost stations at low angle…
▽ More
The high pseudo-rapidity region of the CMS muon system is covered by Cathode Strip Chambers (CSC) only and lacks redundant coverage despite the fact that it is a challenging region for muons in terms of backgrounds and momentum resolution. In order to maintain good efficiency for the muon trigger in this region additional RPCs are planned to be installed in the two outermost stations at low angle named RE3/1 and RE4/1. These stations will use RPCs with finer granularity and good timing resolution to mitigate background effects and to increase the redundancy of the system.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
High rate, fast timing Glass RPC for the high η CMS muon detectors
Authors:
F. Lagarde,
M. Gouzevitch,
I. Laktineh,
V. Buridon,
X. Chen,
C. Combaret,
A. Eynard,
L. Germani,
G. Grenier,
H. Mathez,
L. Mirabito,
A. Petrukhin,
A. Steen,
W. Tromeuraa,
Y. Wang,
A. Gongab,
N. Moreau,
C. de la Taille,
F. Dulucqac,
A. Cimmino,
S. Crucy,
A. Fagot,
M. Gul,
A. A. O. Rios,
M. Tytgat
, et al. (86 additional authors not shown)
Abstract:
The HL-LHC phase is designed to increase by an order of magnitude the amount of data to be collected by the LHC experiments. To achieve this goal in a reasonable time scale the instantaneous luminosity would also increase by an order of magnitude up to $6.10^{34} cm^{-2} s^{-1}$ . The region of the forward muon spectrometer ($|η| > 1.6$) is not equipped with RPC stations. The increase of the expec…
▽ More
The HL-LHC phase is designed to increase by an order of magnitude the amount of data to be collected by the LHC experiments. To achieve this goal in a reasonable time scale the instantaneous luminosity would also increase by an order of magnitude up to $6.10^{34} cm^{-2} s^{-1}$ . The region of the forward muon spectrometer ($|η| > 1.6$) is not equipped with RPC stations. The increase of the expected particles rate up to $2 kHz/cm^{2}$ (including a safety factor 3) motivates the installation of RPC chambers to guarantee redundancy with the CSC chambers already present. The actual RPC technology of CMS cannot sustain the expected background level. The new technology that will be chosen should have a high rate capability and provides a good spatial and timing resolution. A new generation of Glass-RPC (GRPC) using low-resistivity (LR) glass is proposed to equip at least the two most far away of the four high $η$ muon stations of CMS. First the design of small size prototypes and studies of their performance in high-rate particles flux is presented. Then the proposed designs for large size chambers and their fast-timing electronic readout are examined and preliminary results are provided.
△ Less
Submitted 22 July, 2016; v1 submitted 4 June, 2016;
originally announced June 2016.
-
Performance of Resistive Plate Chambers installed during the first long shutdown of the CMS experiment
Authors:
M. Shopova,
A. Aleksandrov,
R. Hadjiiska,
P. Iaydjiev,
G. Sultanov,
M. Rodozov,
S. Stoykova,
Y. Assran,
A. Sayed,
A. Radi,
S. Aly,
G. Singh,
M. Abbrescia,
G. Iaselli,
M. Maggi,
G. Pugliese,
P. Verwilligen,
W. Van Doninck,
S. Colafranceschi,
A. Sharma,
L. Benussi,
S. Bianco,
D. Piccolo,
F. Primavera,
A. Cimmino
, et al. (71 additional authors not shown)
Abstract:
The CMS experiment, located at the CERN Large Hadron Collider, has a redundant muon system composed by three different detector technologies: Cathode Strip Chambers (in the forward regions), Drift Tubes (in the central region) and Resistive Plate Chambers (both its central and forward regions). All three are used for muon reconstruction and triggering. During the first long shutdown (LS1) of the L…
▽ More
The CMS experiment, located at the CERN Large Hadron Collider, has a redundant muon system composed by three different detector technologies: Cathode Strip Chambers (in the forward regions), Drift Tubes (in the central region) and Resistive Plate Chambers (both its central and forward regions). All three are used for muon reconstruction and triggering. During the first long shutdown (LS1) of the LHC (2013-2014) the CMS muon system has been upgraded with 144 newly installed RPCs on the forth forward stations. The new chambers ensure and enhance the muon trigger efficiency in the high luminosity conditions of the LHC Run2. The chambers have been successfully installed and commissioned. The system has been run successfully and experimental data has been collected and analyzed. The performance results of the newly installed RPCs will be presented.
△ Less
Submitted 22 May, 2016;
originally announced May 2016.
-
Graphene-protein bioelectronic devices with wavelength-dependent photoresponse
Authors:
Ye Lu,
Mitchell B. Lerner,
Zhengqing John Qi,
Joseph J. Mitala,
Jong Hsien Lim,
Bohdana M. Discher,
A. T. Charlie Johnson Jr
Abstract:
We implemented a nanoelectronic interface between graphene field effect transistors (FETs) and soluble proteins. This enables production of bioelectronic devices that combine functionalities of the biomolecular and inorganic components. The method serves to link polyhistidine-tagged proteins to graphene FETs using the tag itself. Atomic Force Microscopy and Raman spectroscopy provide structural un…
▽ More
We implemented a nanoelectronic interface between graphene field effect transistors (FETs) and soluble proteins. This enables production of bioelectronic devices that combine functionalities of the biomolecular and inorganic components. The method serves to link polyhistidine-tagged proteins to graphene FETs using the tag itself. Atomic Force Microscopy and Raman spectroscopy provide structural understanding of the bio/nano hybrid; current-gate voltage measurements are used to elucidate the electronic properties. As an example application, we functionalize graphene FETs with fluorescent proteins to yield hybrids that respond to light at wavelengths defined by the optical absorption spectrum of the protein
△ Less
Submitted 28 December, 2011;
originally announced December 2011.