-
Observation Time Difference: an Online Dynamic Objects Removal Method for Ground Vehicles
Authors:
Rongguang Wu,
Chenglin Pang,
Xuankang Wu,
Zheng Fang
Abstract:
In the process of urban environment map**, the sequential accumulations of dynamic objects will leave a large number of traces in the map. These traces will usually have bad influences on the localization accuracy and navigation performance of the robot. Therefore, dynamic objects removal plays an important role for creating clean map. However, conventional dynamic objects removal methods usuall…
▽ More
In the process of urban environment map**, the sequential accumulations of dynamic objects will leave a large number of traces in the map. These traces will usually have bad influences on the localization accuracy and navigation performance of the robot. Therefore, dynamic objects removal plays an important role for creating clean map. However, conventional dynamic objects removal methods usually run offline. That is, the map is reprocessed after it is constructed, which undoubtedly increases additional time costs. To tackle the problem, this paper proposes a novel method for online dynamic objects removal for ground vehicles. According to the observation time difference between the object and the ground where it is located, dynamic objects are classified into two types: suddenly appear and suddenly disappear. For these two kinds of dynamic objects, we propose downward retrieval and upward retrieval methods to eliminate them respectively. We validate our method on SemanticKITTI dataset and author-collected dataset with highly dynamic objects. Compared with other state-of-the-art methods, our method is more efficient and robust, and reduces the running time per frame by more than 60$\%$ on average.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Abelian Group Codes for Classical and Classical-Quantum Channels: One-shot and Asymptotic Rate Bounds
Authors:
James Chin-Jen Pang,
Sandeep Pradhan,
Hessam Mahdavifar
Abstract:
We study the problem of transmission of information over classical and classical-quantum channels in the one-shot regime where the underlying codes are constrained to be group codes. In the achievability part, we introduce a new input probability distribution that incorporates the encoding homomorphism and the underlying channel law. Using a random coding argument, we characterize the performance…
▽ More
We study the problem of transmission of information over classical and classical-quantum channels in the one-shot regime where the underlying codes are constrained to be group codes. In the achievability part, we introduce a new input probability distribution that incorporates the encoding homomorphism and the underlying channel law. Using a random coding argument, we characterize the performance of group codes in terms of hypothesis testing relative-entropic quantities. In the converse part, we establish bounds by leveraging a hypothesis testing-based approach. Furthermore, we apply the one-shot result to the asymptotic stationary memoryless setting, and establish a single-letter lower bound on group capacities for both classes of channels. Moreover, we derive a matching upper bound on the asymptotic group capacity.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases
Authors:
Meng Wang,
Tian Lin,
Aidi Lin,
Kai Yu,
Yuanyuan Peng,
Lianyu Wang,
Cheng Chen,
Ke Zou,
Huiyu Liang,
Man Chen,
Xue Yao,
Meiqin Zhang,
Binwei Huang,
Chaoxin Zheng,
Peixin Zhang,
Wei Chen,
Yilong Luo,
Yifan Chen,
Honghe Xia,
Tingkun Shi,
Qi Zhang,
**ming Guo,
Xiaolin Chen,
**gcheng Wang,
Yih Chung Tham
, et al. (24 additional authors not shown)
Abstract:
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources…
▽ More
Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered.
△ Less
Submitted 30 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Uncovering Limitations of Large Language Models in Information Seeking from Tables
Authors:
Chaoxu Pang,
Yixuan Cao,
Chunhao Yang,
** Luo
Abstract:
Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more re…
▽ More
Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable benchmark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other proprietary and open-source models perform inadequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS performance and robustness against pseudo-relevant tables (common in retrieval-augmented systems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
TANQ: An open domain dataset of table answered questions
Authors:
Mubashara Akhtar,
Chenxi Pang,
Andreea Marzoca,
Yasemin Altun,
Julian Martin Eisenschlos
Abstract:
Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or i…
▽ More
Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model
Authors:
Chao Pang,
Jiang Wu,
Jiayu Li,
Yi Liu,
Jiaxing Sun,
Weijia Li,
Xingxing Weng,
Shuai Wang,
Litong Feng,
Gui-Song Xia,
Conghui He
Abstract:
The generic large Vision-Language Models (VLMs) is rapidly develo**, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack…
▽ More
The generic large Vision-Language Models (VLMs) is rapidly develo**, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Understanding the Role of Pathways in a Deep Neural Network
Authors:
Lei Lyu,
Chen Pang,
Jihua Wang
Abstract:
Deep neural networks have demonstrated superior performance in artificial intelligence applications, but the opaqueness of their inner working mechanism is one major drawback in their application. The prevailing unit-based interpretation is a statistical observation of stimulus-response data, which fails to show a detailed internal process of inherent mechanisms of neural networks. In this work, w…
▽ More
Deep neural networks have demonstrated superior performance in artificial intelligence applications, but the opaqueness of their inner working mechanism is one major drawback in their application. The prevailing unit-based interpretation is a statistical observation of stimulus-response data, which fails to show a detailed internal process of inherent mechanisms of neural networks. In this work, we analyze a convolutional neural network (CNN) trained in the classification task and present an algorithm to extract the diffusion pathways of individual pixels to identify the locations of pixels in an input image associated with object classes. The pathways allow us to test the causal components which are important for classification and the pathway-based representations are clearly distinguishable between categories. We find that the few largest pathways of an individual pixel from an image tend to cross the feature maps in each layer that is important for classification. And the large pathways of images of the same category are more consistent in their trends than those of different categories. We also apply the pathways to understanding adversarial attacks, object completion, and movement perception. Further, the total number of pathways on feature maps in all layers can clearly discriminate the original, deformed, and target samples.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation
Authors:
Changsong Pang,
Xieyuanli Chen,
Yimin Liu,
Huimin Lu,
Yuwei Cheng
Abstract:
Moving object segmentation (MOS) and Ego velocity estimation (EVE) are vital capabilities for mobile systems to achieve full autonomy. Several approaches have attempted to achieve MOSEVE using a LiDAR sensor. However, LiDAR sensors are typically expensive and susceptible to adverse weather conditions. Instead, millimeter-wave radar (MWR) has gained popularity in robotics and autonomous driving for…
▽ More
Moving object segmentation (MOS) and Ego velocity estimation (EVE) are vital capabilities for mobile systems to achieve full autonomy. Several approaches have attempted to achieve MOSEVE using a LiDAR sensor. However, LiDAR sensors are typically expensive and susceptible to adverse weather conditions. Instead, millimeter-wave radar (MWR) has gained popularity in robotics and autonomous driving for real applications due to its cost-effectiveness and resilience to bad weather. Nonetheless, publicly available MOSEVE datasets and approaches using radar data are limited. Some existing methods adopt point convolutional networks from LiDAR-based approaches, ignoring the specific artifacts and the valuable radial velocity information of radar measurements, leading to suboptimal performance. In this paper, we propose a novel transformer network that effectively addresses the sparsity and noise issues and leverages the radial velocity measurements of radar points using our devised radar self- and cross-attention mechanisms. Based on that, our method achieves accurate EVE of the robot and performs MOS using only radar data simultaneously. To thoroughly evaluate the MOSEVE performance of our method, we annotated the radar points in the public View-of-Delft (VoD) dataset and additionally constructed a new radar dataset in various environments. The experimental results demonstrate the superiority of our approach over existing state-of-the-art methods. The code is available at https://github.com/ORCA-Uboat/RadarMOSEVE.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
Authors:
Chao Pang,
Xinzhuo Jiang,
Nishanth Parameshwar Pavinkurve,
Krishna S. Kalluri,
Elise L. Minto,
Jason Patterson,
Linying Zhang,
George Hripcsak,
Gamze Gürsoy,
Noémie Elhadad,
Karthik Natarajan
Abstract:
Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabula…
▽ More
Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.
△ Less
Submitted 5 May, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
HiCD: Change Detection in Quality-Varied Images via Hierarchical Correlation Distillation
Authors:
Chao Pang,
Xingxing Weng,
Jiang Wu,
Qiang Wang,
Gui-Song Xia
Abstract:
Advanced change detection techniques primarily target image pairs of equal and high quality. However, variations in imaging conditions and platforms frequently lead to image pairs with distinct qualities: one image being high-quality, while the other being low-quality. These disparities in image quality present significant challenges for understanding image pairs semantically and extracting change…
▽ More
Advanced change detection techniques primarily target image pairs of equal and high quality. However, variations in imaging conditions and platforms frequently lead to image pairs with distinct qualities: one image being high-quality, while the other being low-quality. These disparities in image quality present significant challenges for understanding image pairs semantically and extracting change features, ultimately resulting in a notable decline in performance. To tackle this challenge, we introduce an innovative training strategy grounded in knowledge distillation. The core idea revolves around leveraging task knowledge acquired from high-quality image pairs to guide the model's learning process when dealing with image pairs that exhibit differences in quality. Additionally, we develop a hierarchical correlation distillation approach (involving self-correlation, cross-correlation, and global correlation). This approach compels the student model to replicate the correlations inherent in the teacher model, rather than focusing solely on individual features. This ensures effective knowledge transfer while maintaining the student model's training flexibility.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning
Authors:
Rongsheng Wang,
Haoming Chen,
Ruizhe Zhou,
Yaofei Duan,
Kunyan Cai,
Han Ma,
Jiaxi Cui,
Jian Li,
Patrick Cheong-Iao Pang,
Yapeng Wang,
Tao Tan
Abstract:
Existing research has demonstrated that refining large language models (LLMs) through the utilization of machine-generated instruction-following data empowers these models to exhibit impressive zero-shot capabilities for novel tasks, without requiring human-authored instructions. In this paper, we systematically investigate, preprocess, and integrate three Chinese instruction-following datasets wi…
▽ More
Existing research has demonstrated that refining large language models (LLMs) through the utilization of machine-generated instruction-following data empowers these models to exhibit impressive zero-shot capabilities for novel tasks, without requiring human-authored instructions. In this paper, we systematically investigate, preprocess, and integrate three Chinese instruction-following datasets with the aim of enhancing the Chinese conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model. Through instruction fine-tuning on this carefully processed dataset, we successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named "Aurora." To assess the performance of Aurora, we utilize three widely recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse Mixture-of-Experts model. This work is pioneering in the execution of instruction fine-tuning on a sparse expert-mixed model, marking a significant breakthrough in enhancing the capabilities of this model architecture. Our code, data and model are publicly available at
https://github.com/WangRongsheng/Aurora
△ Less
Submitted 1 January, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Learning to Denoise Unreliable Interactions for Link Prediction on Biomedical Knowledge Graph
Authors:
Tengfei Ma,
Yujie Chen,
Wen Tao,
Dashun Zheng,
Xuan Lin,
Patrick Cheong-lao Pang,
Yi** Liu,
Yijun Wang,
Bosheng Song,
Xiangxiang Zeng
Abstract:
Link prediction in biomedical knowledge graphs (KGs) aims at predicting unknown interactions between entities, including drug-target interaction (DTI) and drug-drug interaction (DDI), which is critical for drug discovery and therapeutics. Previous methods prefer to utilize the rich semantic relations and topological structure of the KG to predict missing links, yielding promising outcomes. However…
▽ More
Link prediction in biomedical knowledge graphs (KGs) aims at predicting unknown interactions between entities, including drug-target interaction (DTI) and drug-drug interaction (DDI), which is critical for drug discovery and therapeutics. Previous methods prefer to utilize the rich semantic relations and topological structure of the KG to predict missing links, yielding promising outcomes. However, all these works only focus on improving the predictive performance without considering the inevitable noise and unreliable interactions existing in the KGs, which limits the development of KG-based computational methods. To address these limitations, we propose a Denoised Link Prediction framework, called DenoisedLP. DenoisedLP obtains reliable interactions based on the local subgraph by denoising noisy links in a learnable way, providing a universal module for mining underlying task-relevant relations. To collaborate with the smoothed semantic information, DenoisedLP introduces the semantic subgraph by blurring conflict relations around the predicted link. By maximizing the mutual information between the reliable structure and smoothed semantic relations, DenoisedLP emphasizes the informative interactions for predicting relation-specific links. Experimental results on real-world datasets demonstrate that DenoisedLP outperforms state-of-the-art methods on DTI and DDI prediction tasks, and verify the effectiveness and robustness of denoising unreliable interactions on the contaminated KGs.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
Improving the Knowledge Gradient Algorithm
Authors:
Yang Le,
Gao Siyang,
Ho Chin Pang
Abstract:
The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a…
▽ More
The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a remedy for it, by following the manner of one-step look ahead of KG, but instead choosing the measurement that yields the greatest one-step improvement in the probability of selecting the best arm. The new policy is called improved knowledge gradient (iKG). iKG can be shown to be asymptotically optimal. In addition, we show that compared to KG, it is easier to extend iKG to variant problems of BAI, with the $ε$-good arm identification and feasible arm identification as two examples. The superior performances of iKG on these problems are further demonstrated using numerical examples.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Guideline Learning for In-context Information Extraction
Authors:
Chaoxu Pang,
Yixuan Cao,
Qiang Ding,
** Luo
Abstract:
Large language models (LLMs) can perform a new task by merely conditioning on task instructions and a few input-output examples, without optimizing any parameters. This is called In-Context Learning (ICL). In-context Information Extraction (IE) has recently garnered attention in the research community. However, the performance of In-context IE generally lags behind the state-of-the-art supervised…
▽ More
Large language models (LLMs) can perform a new task by merely conditioning on task instructions and a few input-output examples, without optimizing any parameters. This is called In-Context Learning (ICL). In-context Information Extraction (IE) has recently garnered attention in the research community. However, the performance of In-context IE generally lags behind the state-of-the-art supervised expert models. We highlight a key reason for this shortfall: underspecified task description. The limited-length context struggles to thoroughly express the intricate IE task instructions and various edge cases, leading to misalignment in task comprehension with humans. In this paper, we propose a Guideline Learning (GL) framework for In-context IE which reflectively learns and follows guidelines. During the learning phrase, GL automatically synthesizes a set of guidelines based on a few error cases, and during inference, GL retrieves helpful guidelines for better ICL. Moreover, we propose a self-consistency-based active learning method to enhance the efficiency of GL. Experiments on event extraction and relation extraction show that GL can significantly improve the performance of in-context IE.
△ Less
Submitted 21 October, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity
Authors:
Hao Shi,
Chengshan Pang,
Jiaming Zhang,
Kailun Yang,
Yuhao Wu,
Huajian Ni,
Yining Lin,
Rainer Stiefelhagen,
Kaiwei Wang
Abstract:
Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses pre…
▽ More
Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.
△ Less
Submitted 17 October, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.
-
IvyGPT: InteractiVe Chinese pathwaY language model in medical domain
Authors:
Rongsheng Wang,
Yaofei Duan,
ChanTong Lam,
Jiexi Chen,
Jiangsheng Xu,
Haoming Chen,
Xiaohong Liu,
Patrick Cheong-Iao Pang,
Tao Tan
Abstract:
General large language models (LLMs) such as ChatGPT have shown remarkable success. However, such LLMs have not been widely adopted for medical purposes, due to poor accuracy and inability to provide medical advice. We propose IvyGPT, an LLM based on LLaMA that is trained and fine-tuned with high-quality medical question-answer (QA) instances and Reinforcement Learning from Human Feedback (RLHF).…
▽ More
General large language models (LLMs) such as ChatGPT have shown remarkable success. However, such LLMs have not been widely adopted for medical purposes, due to poor accuracy and inability to provide medical advice. We propose IvyGPT, an LLM based on LLaMA that is trained and fine-tuned with high-quality medical question-answer (QA) instances and Reinforcement Learning from Human Feedback (RLHF). After supervised fine-tuning, IvyGPT has good multi-turn conversation capabilities, but it cannot perform like a doctor in other aspects, such as comprehensive diagnosis. Through RLHF, IvyGPT can output richer diagnosis and treatment answers that are closer to human. In the training, we used QLoRA to train 33 billion parameters on a small number of NVIDIA A100 (80GB) GPUs. Experimental results show that IvyGPT has outperformed other medical GPT models.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Configurable Spatial-Temporal Hierarchical Analysis for Flexible Video Anomaly Detection
Authors:
Kai Cheng,
Xinhua Zeng,
Yang Liu,
Tian Wang,
Chengxin Pang,
**g Teng,
Zhaoyang Xia,
**g Liu
Abstract:
Video anomaly detection (VAD) is a vital task with great practical applications in industrial surveillance, security system, and traffic control. Unlike previous unsupervised VAD methods that adopt a fixed structure to learn normality without considering different detection demands, we design a spatial-temporal hierarchical architecture (STHA) as a configurable architecture to flexibly detect diff…
▽ More
Video anomaly detection (VAD) is a vital task with great practical applications in industrial surveillance, security system, and traffic control. Unlike previous unsupervised VAD methods that adopt a fixed structure to learn normality without considering different detection demands, we design a spatial-temporal hierarchical architecture (STHA) as a configurable architecture to flexibly detect different degrees of anomaly. The comprehensive structure of the STHA is delineated into a tripartite hierarchy, encompassing the following tiers: the stream level, the stack level, and the block level. Specifically, we design several auto-encoder-based blocks that possess varying capacities for extracting normal patterns. Then, we stack blocks according to the complexity degrees with both intra-stack and inter-stack residual links to learn hierarchical normality gradually. Considering the multisource knowledge of videos, we also model the spatial normality of video frames and temporal normality of RGB difference by designing two parallel streams consisting of stacks. Thus, STHA can provide various representation learning abilities by expanding or contracting hierarchically to detect anomalies of different degrees. Since the anomaly set is complicated and unbounded, our STHA can adjust its detection ability to adapt to the human detection demands and the complexity degree of anomaly that happened in the history of a scene. We conduct experiments on three benchmarks and perform extensive analysis, and the results demonstrate that our method performs comparablely to the state-of-the-art methods. In addition, we design a toy dataset to prove that our model can better balance the learning ability to adapt to different detection demands.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Uncertainty-inspired Open Set Learning for Retinal Anomaly Identification
Authors:
Meng Wang,
Tian Lin,
Lianyu Wang,
Aidi Lin,
Ke Zou,
Xinxing Xu,
Yi Zhou,
Yuanyuan Peng,
Qingquan Meng,
Yiming Qian,
Guoyao Deng,
Zhiqun Wu,
Junhong Chen,
Jianhong Lin,
Mingzhi Zhang,
Weifang Zhu,
Changqing Zhang,
Daoqiang Zhang,
Rick Siow Mong Goh,
Yong Liu,
Chi Pui Pang,
Xinjian Chen,
Haoyu Chen,
Huazhu Fu
Abstract:
Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calcul…
▽ More
Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calculated an uncertainty score to express its confidence. Our UIOS model with thresholding strategy achieved an F1 score of 99.55%, 97.01% and 91.91% for the internal testing set, external target categories (TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1 score of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS correctly predicted high uncertainty scores, which would prompt the need for a manual check in the datasets of non-target categories retinal diseases, low-quality fundus images, and non-fundus images. UIOS provides a robust method for real-world screening of retinal anomalies.
△ Less
Submitted 29 August, 2023; v1 submitted 8 April, 2023;
originally announced April 2023.
-
StoryChat: Designing a Narrative-Based Viewer Participation Tool for Live Streaming Chatrooms
Authors:
Ryan Yen,
Li Feng,
Brinda Mehra,
Ching Christie Pang,
Siying Hu,
Zhicong Lu
Abstract:
Live streaming platforms and existing viewer participation tools enable users to interact and engage with an online community, but the anonymity and scale of chat usually result in the spread of negative comments. However, only a few existing moderation tools investigate the influence of proactive moderation on viewers' engagement and prosocial behavior. To address this, we developed StoryChat, a…
▽ More
Live streaming platforms and existing viewer participation tools enable users to interact and engage with an online community, but the anonymity and scale of chat usually result in the spread of negative comments. However, only a few existing moderation tools investigate the influence of proactive moderation on viewers' engagement and prosocial behavior. To address this, we developed StoryChat, a narrative-based viewer participation tool that utilizes a dynamic graphical plot to reflect chatroom negativity. We crafted the narrative through a viewer-centered (N=65) iterative design process and evaluated the tool with 48 experienced viewers in a deployment study. We discovered that StoryChat encouraged viewers to contribute prosocial comments, increased viewer engagement, and fostered viewers' sense of community. Viewers reported a closer connection between streamers and other viewers because of the narrative design, suggesting that narrative-based viewer engagement tools have the potential to encourage community engagement and prosocial behaviors.
△ Less
Submitted 7 April, 2023;
originally announced April 2023.
-
Capacity-achieving Polar-based Codes with Sparsity Constraints on the Generator Matrices
Authors:
James Chin-Jen Pang,
Hessam Mahdavifar,
S. Sandeep Pradhan
Abstract:
In this paper, we leverage polar codes and the well-established channel polarization to design capacity-achieving codes with a certain constraint on the weights of all the columns in the generator matrix (GM) while having a low-complexity decoding algorithm. We first show that given a binary-input memoryless symmetric (BMS) channel $W$ and a constant $s \in (0, 1]$, there exists a polarization ker…
▽ More
In this paper, we leverage polar codes and the well-established channel polarization to design capacity-achieving codes with a certain constraint on the weights of all the columns in the generator matrix (GM) while having a low-complexity decoding algorithm. We first show that given a binary-input memoryless symmetric (BMS) channel $W$ and a constant $s \in (0, 1]$, there exists a polarization kernel such that the corresponding polar code is capacity-achieving with the \textit{rate of polarization} $s/2$, and the GM column weights being bounded from above by $N^s$. To improve the sparsity versus error rate trade-off, we devise a column-splitting algorithm and two coding schemes for BEC and then for general BMS channels. The \textit{polar-based} codes generated by the two schemes inherit several fundamental properties of polar codes with the original $2 \times 2$ kernel including the decay in error probability, decoding complexity, and the capacity-achieving property. Furthermore, they demonstrate the additional property that their GM column weights are bounded from above sublinearly in $N$, while the original polar codes have some column weights that are linear in $N$. In particular, for any BEC and $β<0.5$, the existence of a sequence of capacity-achieving polar-based codes where all the GM column weights are bounded from above by $N^λ$ with $λ\approx 0.585$, and with the error probability bounded by $O(2^{-N^β} )$ under a decoder with complexity $O(N\log N)$, is shown. The existence of similar capacity-achieving polar-based codes with the same decoding complexity is shown for any BMS channel and $β<0.5$ with $λ\approx 0.631$.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Harms from Increasingly Agentic Algorithmic Systems
Authors:
Alan Chan,
Rebecca Salganik,
Alva Markelius,
Chris Pang,
Nitarshan Rajkumar,
Dmitrii Krasheninnikov,
Lauro Langosco,
Zhonghao He,
Yawen Duan,
Micah Carroll,
Michelle Lin,
Alex Mayhew,
Katherine Collins,
Maryam Molamohammadi,
John Burden,
Wanru Zhao,
Shalaleh Rismani,
Konstantinos Voudouris,
Umang Bhatt,
Adrian Weller,
David Krueger,
Tegan Maharaj
Abstract:
Research in Fairness, Accountability, Transparency, and Ethics (FATE) has established many sources and forms of algorithmic harm, in domains as diverse as health care, finance, policing, and recommendations. Much work remains to be done to mitigate the serious harms of these systems, particularly those disproportionately affecting marginalized communities. Despite these ongoing harms, new systems…
▽ More
Research in Fairness, Accountability, Transparency, and Ethics (FATE) has established many sources and forms of algorithmic harm, in domains as diverse as health care, finance, policing, and recommendations. Much work remains to be done to mitigate the serious harms of these systems, particularly those disproportionately affecting marginalized communities. Despite these ongoing harms, new systems are being developed and deployed which threaten the perpetuation of the same harms and the creation of novel ones. In response, the FATE community has emphasized the importance of anticipating harms. Our work focuses on the anticipation of harms from increasingly agentic systems. Rather than providing a definition of agency as a binary property, we identify 4 key characteristics which, particularly in combination, tend to increase the agency of a given algorithmic system: underspecification, directness of impact, goal-directedness, and long-term planning. We also discuss important harms which arise from increasing agency -- notably, these include systemic and/or long-range impacts, often on marginalized stakeholders. We emphasize that recognizing agency of algorithmic systems does not absolve or shift the human responsibility for algorithmic harms. Rather, we use the term agency to highlight the increasingly evident fact that ML systems are not fully under human control. Our work explores increasingly agentic algorithmic systems in three parts. First, we explain the notion of an increase in agency for algorithmic systems in the context of diverse perspectives on agency across disciplines. Second, we argue for the need to anticipate harms from increasingly agentic systems. Third, we discuss important harms from increasingly agentic systems and ways forward for addressing them. We conclude by reflecting on implications of our work for anticipating algorithmic harms from emerging systems.
△ Less
Submitted 11 May, 2023; v1 submitted 20 February, 2023;
originally announced February 2023.
-
ASDF: A Differential Testing Framework for Automatic Speech Recognition Systems
Authors:
Daniel Hao Xian Yuen,
Andrew Yong Chen Pang,
Zhou Yang,
Chun Yong Chong,
Mei Kuan Lim,
David Lo
Abstract:
Recent years have witnessed wider adoption of Automated Speech Recognition (ASR) techniques in various domains. Consequently, evaluating and enhancing the quality of ASR systems is of great importance. This paper proposes ASDF, an Automated Speech Recognition Differential Testing Framework for testing ASR systems. ASDF extends an existing ASR testing tool, the CrossASR++, which synthesizes test ca…
▽ More
Recent years have witnessed wider adoption of Automated Speech Recognition (ASR) techniques in various domains. Consequently, evaluating and enhancing the quality of ASR systems is of great importance. This paper proposes ASDF, an Automated Speech Recognition Differential Testing Framework for testing ASR systems. ASDF extends an existing ASR testing tool, the CrossASR++, which synthesizes test cases from a text corpus. However, CrossASR++ fails to make use of the text corpus efficiently and provides limited information on how the failed test cases can improve ASR systems. To address these limitations, our tool incorporates two novel features: (1) a text transformation module to boost the number of generated test cases and uncover more errors in ASR systems and (2) a phonetic analysis module to identify on which phonemes the ASR system tend to produce errors. ASDF generates more high-quality test cases by applying various text transformation methods (e.g., change tense) to the texts in failed test cases. By doing so, ASDF can utilize a small text corpus to generate a large number of audio test cases, something which CrossASR++ is not capable of. In addition, ASDF implements more metrics to evaluate the performance of ASR systems from multiple perspectives. ASDF performs phonetic analysis on the identified failed test cases to identify the phonemes that ASR systems tend to transcribe incorrectly, providing useful information for developers to improve ASR systems. The demonstration video of our tool is made online at https://www.youtube.com/watch?v=DzVwfc3h9As. The implementation is available at https://github.com/danielyuenhx/asdf-differential-testing.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
Authors:
Pengfei Zhu,
Chao Pang,
Yekun Chai,
Lei Li,
Shuohuan Wang,
Yu Sun,
Hao Tian,
Hua Wu
Abstract:
In recent years, the burgeoning interest in diffusion models has led to significant advances in image and speech generation. Nevertheless, the direct synthesis of music waveforms from unrestricted textual prompts remains a relatively underexplored domain. In response to this lacuna, this paper introduces a pioneering contribution in the form of a text-to-waveform music generation model, underpinne…
▽ More
In recent years, the burgeoning interest in diffusion models has led to significant advances in image and speech generation. Nevertheless, the direct synthesis of music waveforms from unrestricted textual prompts remains a relatively underexplored domain. In response to this lacuna, this paper introduces a pioneering contribution in the form of a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process within the diffusion model framework. Addressing the challenge of limited text-music parallel data, we undertake the creation of a dataset by harnessing web resources, a task facilitated by weak supervision techniques. Furthermore, a rigorous empirical inquiry is undertaken to contrast the efficacy of two distinct prompt formats for text conditioning, namely, music tags and unconstrained textual descriptions. The outcomes of this comparative analysis affirm the superior performance of our proposed model in terms of enhancing text-music relevance. Finally, our work culminates in a demonstrative exhibition of the excellent capabilities of our model in text-to-music generation. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.
△ Less
Submitted 21 September, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
Skeleton-based Action Recognition through Contrasting Two-Stream Spatial-Temporal Networks
Authors:
Chen Pang,
Xuequan Lu,
Lei Lyu
Abstract:
For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action ``clap** hands''). For this, we propo…
▽ More
For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action ``clap** hands''). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Detecting Building Changes with Off-Nadir Aerial Images
Authors:
Chao Pang,
Jiang Wu,
Jian Ding,
Can Song,
Gui-Song Xia
Abstract:
The tilted viewing nature of the off-nadir aerial images brings severe challenges to the building change detection (BCD) problem: the mismatch of the nearby buildings and the semantic ambiguity of the building facades. To tackle these challenges, we present a multi-task guided change detection network model, named as MTGCD-Net. The proposed model approaches the specific BCD problem by designing th…
▽ More
The tilted viewing nature of the off-nadir aerial images brings severe challenges to the building change detection (BCD) problem: the mismatch of the nearby buildings and the semantic ambiguity of the building facades. To tackle these challenges, we present a multi-task guided change detection network model, named as MTGCD-Net. The proposed model approaches the specific BCD problem by designing three auxiliary tasks, including: (1) a pixel-wise classification task to predict the roofs and facades of buildings; (2) an auxiliary task for learning the roof-to-footprint offsets of each building to account for the misalignment between building roof instances; and (3) an auxiliary task for learning the identical roof matching flow between bi-temporal aerial images to tackle the building roof mismatch problem. These auxiliary tasks provide indispensable and complementary building parsing and matching information. The predictions of the auxiliary tasks are finally fused to the main building change detection branch with a multi-modal distillation module. To train and test models for the BCD problem with off-nadir aerial images, we create a new benchmark dataset, named BANDON. Extensive experiments demonstrate that our model achieves superior performance over the previous state-of-the-art competitors.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
DePlot: One-shot visual language reasoning by plot-to-table translation
Authors:
Fangyu Liu,
Julian Martin Eisenschlos,
Francesco Piccinno,
Syrine Krichene,
Chenxi Pang,
Kenton Lee,
Mandar Joshi,
Wenhu Chen,
Nigel Collier,
Yasemin Altun
Abstract:
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual languag…
▽ More
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.
△ Less
Submitted 23 May, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
Authors:
Fangyu Liu,
Francesco Piccinno,
Syrine Krichene,
Chenxi Pang,
Kenton Lee,
Mandar Joshi,
Yasemin Altun,
Nigel Collier,
Julian Martin Eisenschlos
Abstract:
Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks…
▽ More
Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling.
We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.
△ Less
Submitted 23 May, 2023; v1 submitted 19 December, 2022;
originally announced December 2022.
-
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
Authors:
Yekun Chai,
Shuohuan Wang,
Chao Pang,
Yu Sun,
Hao Tian,
Hua Wu
Abstract:
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs…
▽ More
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
△ Less
Submitted 19 May, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Multi-view deep learning based molecule design and structural optimization accelerates the SARS-CoV-2 inhibitor discovery
Authors:
Chao Pang,
Yu Wang,
Yi Jiang,
Ruheng Wang,
Ran Su,
Leyi Wei
Abstract:
In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and a…
▽ More
In this work, we propose MEDICO, a Multi-viEw Deep generative model for molecule generation, structural optimization, and the SARS-CoV-2 Inhibitor disCOvery. To the best of our knowledge, MEDICO is the first-of-this-kind graph generative model that can generate molecular graphs similar to the structure of targeted molecules, with a multi-view representation learning framework to sufficiently and adaptively learn comprehensive structural semantics from targeted molecular topology and geometry. We show that our MEDICO significantly outperforms the state-of-the-art methods in generating valid, unique, and novel molecules under benchmarking comparisons. In particular, we showcase the multi-view deep learning model enables us to generate not only the molecules structurally similar to the targeted molecules but also the molecules with desired chemical properties, demonstrating the strong capability of our model in exploring the chemical space deeply. Moreover, case study results on targeted molecule generation for the SARS-CoV-2 main protease (Mpro) show that by integrating molecule docking into our model as chemical priori, we successfully generate new small molecules with desired drug-like properties for the Mpro, potentially accelerating the de novo design of Covid-19 drugs. Further, we apply MEDICO to the structural optimization of three well-known Mpro inhibitors (N3, 11a, and GC376) and achieve ~88% improvement in their binding affinity to Mpro, demonstrating the application value of our model for the development of therapeutics for SARS-CoV-2 infection.
△ Less
Submitted 3 December, 2022;
originally announced December 2022.
-
LGN-Net: Local-Global Normality Network for Video Anomaly Detection
Authors:
Mengyang Zhao,
Xinhua Zeng,
Yang Liu,
**g Liu,
Di Li,
Xing Hu,
Chengxin Pang
Abstract:
Video anomaly detection (VAD) has been intensively studied for years because of its potential applications in intelligent video systems. Existing unsupervised VAD methods tend to learn normality from training sets consisting of only normal videos and regard instances deviating from such normality as anomalies. However, they often consider only local or global normality in the temporal dimension. S…
▽ More
Video anomaly detection (VAD) has been intensively studied for years because of its potential applications in intelligent video systems. Existing unsupervised VAD methods tend to learn normality from training sets consisting of only normal videos and regard instances deviating from such normality as anomalies. However, they often consider only local or global normality in the temporal dimension. Some of them focus on learning local spatiotemporal representations from consecutive frames to enhance the representation for normal events. But powerful representation allows these methods to represent some anomalies and causes miss detection. In contrast, the other methods are devoted to memorizing prototypical normal patterns of whole training videos to weaken the generalization for anomalies, which also restricts them from representing diverse normal patterns and causes false alarm. To this end, we propose a two-branch model, Local-Global Normality Network (LGN-Net), to simultaneously learn local and global normality. Specifically, one branch learns the evolution regularities of appearance and motion from consecutive frames as local normality utilizing a spatiotemporal prediction network, while the other branch memorizes prototype features of the whole videos as global normality by a memory module. LGN-Net achieves a balance of representing normal and abnormal instances by fusing local and global normality. In addition, the fused normality enables LGN-Net to generalize to various scenes more than exploiting single normality. Experiments demonstrate the effectiveness and superior performance of our method. The code is available online: https://github.com/Myzhao1999/LGN-Net.
△ Less
Submitted 8 January, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report
Authors:
Andrey Ignatov,
Radu Timofte,
Shuai Liu,
Chaoyu Feng,
Furui Bai,
Xiaotao Wang,
Lei Lei,
Ziyao Yi,
Yan Xiang,
Zibin Liu,
Shaoqing Li,
Keming Shi,
Dehui Kong,
Ke Xu,
Minsu Kwon,
Yaqi Wu,
Jiesi Zheng,
Zhihao Fan,
Xun Wu,
Feng Zhang,
Albert No,
Minhyeok Cho,
Zewen Chen,
Xiaze Zhang,
Ran Li
, et al. (13 additional authors not shown)
Abstract:
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. Th…
▽ More
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech
Authors:
Xiaoran Fan,
Chao Pang,
Tian Yuan,
He Bai,
Renjie Zheng,
Pengfei Zhu,
Shuohuan Wang,
Junkun Chen,
Zeyu Chen,
Liang Huang,
Yu Sun,
Hua Wu
Abstract:
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We prop…
▽ More
Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.
△ Less
Submitted 4 December, 2022; v1 submitted 7 November, 2022;
originally announced November 2022.
-
MechRetro is a chemical-mechanism-driven graph learning framework for interpretable retrosynthesis prediction and pathway planning
Authors:
Yu Wang,
Chao Pang,
Yuzhe Wang,
Yi Jiang,
Junru **,
Sirui Liang,
Quan Zou,
Leyi Wei
Abstract:
Leveraging artificial intelligence for automatic retrosynthesis speeds up organic pathway planning in digital laboratories. However, existing deep learning approaches are unexplainable, like "black box" with few insights, notably limiting their applications in real retrosynthesis scenarios. Here, we propose MechRetro, a chemical-mechanism-driven graph learning framework for interpretable retrosynt…
▽ More
Leveraging artificial intelligence for automatic retrosynthesis speeds up organic pathway planning in digital laboratories. However, existing deep learning approaches are unexplainable, like "black box" with few insights, notably limiting their applications in real retrosynthesis scenarios. Here, we propose MechRetro, a chemical-mechanism-driven graph learning framework for interpretable retrosynthetic prediction and pathway planning, which learns several retrosynthetic actions to simulate a reverse reaction via elaborate self-adaptive joint learning. By integrating chemical knowledge as prior information, we design a novel Graph Transformer architecture to adaptively learn discriminative and chemically meaningful molecule representations, highlighting the strong capacity in molecule feature representation learning. We demonstrate that MechRetro outperforms the state-of-the-art approaches for retrosynthetic prediction with a large margin on large-scale benchmark datasets. Extending MechRetro to the multi-step retrosynthesis analysis, we identify efficient synthetic routes via an interpretable reasoning mechanism, leading to a better understanding in the realm of knowledgeable synthetic chemists. We also showcase that MechRetro discovers a novel pathway for protokylol, along with energy scores for uncertainty assessment, broadening the applicability for practical scenarios. Overall, we expect MechRetro to provide meaningful insights for high-throughput automated organic synthesis in drug discovery.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Viko 2.0: A Hierarchical Gecko-inspired Adhesive Gripper with Visuotactile Sensor
Authors:
Chohei Pang,
Qicheng Wang,
Kinwing Mak,
Hongyu Yu,
Michael Yu Wang
Abstract:
Robotic grippers with visuotactile sensors have access to rich tactile information for gras** tasks but encounter difficulty in partially encompassing large objects with sufficient grip force. While hierarchical gecko-inspired adhesives are a potential technique for bridging performance gaps, they require a large contact area for efficient usage. In this work, we present a new version of an adap…
▽ More
Robotic grippers with visuotactile sensors have access to rich tactile information for gras** tasks but encounter difficulty in partially encompassing large objects with sufficient grip force. While hierarchical gecko-inspired adhesives are a potential technique for bridging performance gaps, they require a large contact area for efficient usage. In this work, we present a new version of an adaptive gecko gripper called Viko 2.0 that effectively combines the advantage of adhesives and visuotactile sensors. Compared with a non-hierarchical structure, a hierarchical structure with a multimaterial design achieves approximately a 1.5 times increase in normal adhesion and double in contact area. The integrated visuotactile sensor captures a deformation image of the hierarchical structure and provides a real-time measurement of contact area, shear force, and incipient slip detection at 24 Hz. The gripper is implemented on a robotic arm to demonstrate an adaptive gras** pose based on contact area, and grasps objects with a wide range of geometries and textures.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
Authors:
Yuzhi Zhao,
Lai-Man Po,
Xuehui Wang,
Qiong Yan,
Wei Shen,
Yujia Zhang,
Wei Liu,
Chun-Kit Wong,
Chiu-Sing Pang,
Weifeng Ou,
Wing-Yin Yu,
Buhua Liu
Abstract:
The appearances of children are inherited from their parents, which makes it feasible to predict them. Predicting realistic children's faces may help settle many social problems, such as age-invariant face recognition, kinship verification, and missing child identification. It can be regarded as an image-to-image translation task. Existing approaches usually assume domain information in the image-…
▽ More
The appearances of children are inherited from their parents, which makes it feasible to predict them. Predicting realistic children's faces may help settle many social problems, such as age-invariant face recognition, kinship verification, and missing child identification. It can be regarded as an image-to-image translation task. Existing approaches usually assume domain information in the image-to-image translation can be interpreted by "style", i.e., the separation of image content and style. However, such separation is improper for the child face prediction, because the facial contours between children and parents are not the same. To address this issue, we propose a new disentangled learning strategy for children's face prediction. We assume that children's faces are determined by genetic factors (compact family features, e.g., face contour), external factors (facial attributes irrelevant to prediction, such as moustaches and glasses), and variety factors (individual properties for each child). On this basis, we formulate predictions as a map** from parents' genetic factors to children's genetic factors, and disentangle them from external and variety factors. In order to obtain accurate genetic factors and perform the map**, we propose a ChildPredictor framework. It transfers human faces to genetic factors by encoders and back by generators. Then, it learns the relationship between the genetic factors of parents and children through a map** function. To ensure the generated faces are realistic, we collect a large Family Face Database to train ChildPredictor and evaluate it on the FF-Database validation set. Experimental results demonstrate that ChildPredictor is superior to other well-known image-to-image translation methods in predicting realistic and diverse child faces. Implementation codes can be found at https://github.com/zhaoyuzhi/ChildPredictor.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
New Bounds on the Size of Binary Codes with Large Minimum Distance
Authors:
James Chin-Jen Pang,
Hessam Mahdavifar,
S. Sandeep Pradhan
Abstract:
Let $A(n, d)$ denote the maximum size of a binary code of length $n$ and minimum Hamming distance $d$. Studying $A(n, d)$, including efforts to determine it as well to derive bounds on $A(n, d)$ for large $n$'s, is one of the most fundamental subjects in coding theory. In this paper, we explore new lower and upper bounds on $A(n, d)$ in the large-minimum distance regime, in particular, when…
▽ More
Let $A(n, d)$ denote the maximum size of a binary code of length $n$ and minimum Hamming distance $d$. Studying $A(n, d)$, including efforts to determine it as well to derive bounds on $A(n, d)$ for large $n$'s, is one of the most fundamental subjects in coding theory. In this paper, we explore new lower and upper bounds on $A(n, d)$ in the large-minimum distance regime, in particular, when $d = n/2 - Ω(\sqrt{n})$. We first provide a new construction of cyclic codes, by carefully selecting specific roots in the binary extension field for the check polynomial, with length $n= 2^m -1$, distance $d \geq n/2 - 2^{c-1}\sqrt{n}$, and size $n^{c+1/2}$, for any $m\geq 4$ and any integer $c$ with $0 \leq c \leq m/2 - 1$. These code parameters are slightly worse than those of the Delsarte--Goethals (DG) codes that provide the previously known best lower bound in the large-minimum distance regime. However, using a similar and extended code construction technique we show a sequence of cyclic codes that improve upon DG codes and provide the best lower bound in a narrower range of the minimum distance $d$, in particular, when $d = n/2 - Ω(n^{2/3})$. Furthermore, by leveraging a Fourier-analytic view of Delsarte's linear program, upper bounds on $A(n, n/2 - ρ\sqrt{n})$ with $ρ\in (0.5, 9.5)$ are obtained that scale polynomially in $n$. To the best of authors' knowledge, the upper bound due to Barg and Nogin \cite{barg2006spectral} is the only previously known upper bound that scale polynomially in $n$ in this regime. We numerically demonstrate that our upper bound improves upon the Barg-Nogin upper bound in the specified high-minimum distance regime.
△ Less
Submitted 23 May, 2023; v1 submitted 7 February, 2022;
originally announced February 2022.
-
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Authors:
Shuohuan Wang,
Yu Sun,
Yang Xiang,
Zhihua Wu,
Siyu Ding,
Weibao Gong,
Shikun Feng,
Junyuan Shang,
Yanbin Zhao,
Chao Pang,
Jiaxiang Liu,
Xuyi Chen,
Yuxiang Lu,
Weixin Liu,
Xi Wang,
Yangfan Bai,
Qiuliang Chen,
Li Zhao,
Shiyong Li,
Peng Sun,
Dianhai Yu,
Yanjun Ma,
Hao Tian,
Hua Wu,
Tian Wu
, et al. (4 additional authors not shown)
Abstract:
Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outp…
▽ More
Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks
Authors:
Chao Pang,
Xinzhuo Jiang,
Krishna S Kalluri,
Matthew Spotnitz,
RuiJun Chen,
Adler Perotte,
Karthik Natarajan
Abstract:
Embedding algorithms are increasingly used to represent clinical concepts in healthcare for improving machine learning tasks such as clinical phenoty** and disease prediction. Recent studies have adapted state-of-the-art bidirectional encoder representations from transformers (BERT) architecture to structured electronic health records (EHR) data for the generation of contextualized concept embed…
▽ More
Embedding algorithms are increasingly used to represent clinical concepts in healthcare for improving machine learning tasks such as clinical phenoty** and disease prediction. Recent studies have adapted state-of-the-art bidirectional encoder representations from transformers (BERT) architecture to structured electronic health records (EHR) data for the generation of contextualized concept embeddings, yet do not fully incorporate temporal data across multiple clinical domains. Therefore we developed a new BERT adaptation, CEHR-BERT, to incorporate temporal information using a hybrid approach by augmenting the input to BERT using artificial time tokens, incorporating time, age, and concept embeddings, and introducing a new second learning objective for visit type. CEHR-BERT was trained on a subset of Columbia University Irving Medical Center-York Presbyterian Hospital's clinical data, which includes 2.4M patients, spanning over three decades, and tested using 4-fold cross-validation on the following prediction tasks: hospitalization, death, new heart failure (HF) diagnosis, and HF readmission. Our experiments show that CEHR-BERT outperformed existing state-of-the-art clinical BERT adaptations and baseline models across all 4 prediction tasks in both ROC-AUC and PR-AUC. CEHR-BERT also demonstrated strong transfer learning capability, as our model trained on only 5% of data outperformed comparison models trained on the entire data set. Ablation studies to better understand the contribution of each time component showed incremental gains with every element, suggesting that CEHR-BERT's incorporation of artificial time tokens, time and age embeddings with concept embeddings, and the addition of the second learning objective represents a promising approach for future BERT-based clinical embeddings.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
Facilitating Parallel Fuzzing with mutually-exclusive Task Distribution
Authors:
Yifan Wang,
Yuchen Zhang,
Chengbin Pang,
Peng Li,
Nikolaos Triandopoulos,
Jun Xu
Abstract:
Fuzz testing, or fuzzing, has become one of the de facto standard techniques for bug finding in the software industry. In general, fuzzing provides various inputs to the target program to discover unhandled exceptions and crashes. In business sectors where the time budget is limited, software vendors often launch many fuzzing instances in parallel as common means of increasing code coverage. Howev…
▽ More
Fuzz testing, or fuzzing, has become one of the de facto standard techniques for bug finding in the software industry. In general, fuzzing provides various inputs to the target program to discover unhandled exceptions and crashes. In business sectors where the time budget is limited, software vendors often launch many fuzzing instances in parallel as common means of increasing code coverage. However, most of the popular fuzzing tools in their parallel mode-naively run multiple instances concurrently, without elaborate distribution of workload. This can lead different instances to explore overlapped code regions, eventually reducing the benefits of concurrency. In this paper, we propose a general model to describe parallel fuzzing. This model distributes mutually-exclusive but similarly-weighted tasks to different instances, facilitating concurrency and also fairness across instances. Following this model, we develop a solution, called AFL-EDGE, to improve the parallel mode of AFL, considering a round of mutations to a unique seed as a task and adopting edge coverage to define the uniqueness of a seed. We have implemented AFL-EDGE on top of AFL and evaluated the implementation with AFL on 9 widely used benchmark programs. It shows that AFL-EDGE can benefit the edge coverage of AFL. In a 24-hour test, the increase of edge coverage brought by AFL-EDGE to AFL ranges from 9.49% to 10.20%, depending on the number of instances. As a side benefit, we discovered 14 previously unknown bugs.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Requirements-Aided Automatic Test Case Generation for Industrial Cyber-physical Systems
Authors:
Roopak Sinha,
Cheng Pang,
Gerardo Santillán Martínez,
Juha Kuronen,
Valeriy Vyatkin
Abstract:
Industrial cyber-physical systems require complex distributed software to orchestrate many heterogeneous mechatronic components and control multiple physical processes. Industrial automation software is typically developed in a model-driven fashion where abstractions of physical processes called plant models are co-developed and iteratively refined along with the control code. Testing such multi-d…
▽ More
Industrial cyber-physical systems require complex distributed software to orchestrate many heterogeneous mechatronic components and control multiple physical processes. Industrial automation software is typically developed in a model-driven fashion where abstractions of physical processes called plant models are co-developed and iteratively refined along with the control code. Testing such multi-dimensional systems is extremely difficult because often models might not be accurate, do not correspond accurately with subsequent refinements, and the software must eventually be tested on the real plant, especially in safety-critical systems like nuclear plants. This paper proposes a framework wherein high-level functional requirements are used to automatically generate test cases for designs at all abstraction levels in the model-driven engineering process. Requirements are initially specified in natural language and then analyzed and specified using a formalized ontology. The requirements ontology is then refined along with controller and plant models during design and development stages such that test cases can be generated automatically at any stage. A representative industrial water process system case study illustrates the strengths of the proposed formalism. The requirements meta-model proposed by the CESAR European project is used for requirements engineering while IEC 61131-3 and model-driven concepts are used in the design and development phases. A tool resulting from the proposed framework called REBATE (Requirements Based Automatic Testing Engine) is used to generate and execute test cases for increasingly concrete controller and plant models.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Authors:
Yu Sun,
Shuohuan Wang,
Shikun Feng,
Siyu Ding,
Chao Pang,
Junyuan Shang,
Jiaxiang Liu,
Xuyi Chen,
Yanbin Zhao,
Yuxiang Lu,
Weixin Liu,
Zhihua Wu,
Weibao Gong,
Jianzhong Liang,
Zhizhou Shang,
Peng Sun,
Wei Liu,
Xuan Ouyang,
Dianhai Yu,
Hao Tian,
Hua Wu,
Haifeng Wang
Abstract:
Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, the…
▽ More
Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and world knowledge. In addition, most large-scale models are trained in an auto-regressive way. As a result, this kind of traditional fine-tuning approach demonstrates relatively weak performance when solving downstream language understanding tasks. In order to solve the above problems, we propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks with zero-shot learning, few-shot learning or fine-tuning. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph. Empirical results show that the model outperforms the state-of-the-art models on 54 Chinese NLP tasks, and its English version achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing the human performance by +0.8% (90.6% vs. 89.8%).
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
Proving LTL Properties of Bitvector Programs and Decompiled Binaries (Extended)
Authors:
Yuandong Cyrus Liu,
Chengbin Pang,
Daniel Dietsch,
Eric Koskinen,
Ton-Chanh Le,
Georgios Portokalidis,
Jun Xu
Abstract:
There is increasing interest in applying verification tools to programs that have bitvector operations (eg., binaries). SMT solvers, which serve as a foundation for these tools, have thus increased support for bitvector reasoning through bit-blasting and linear arithmetic approximations. In this paper we show that similar linear arithmetic approximation of bitvector operations can be done at the s…
▽ More
There is increasing interest in applying verification tools to programs that have bitvector operations (eg., binaries). SMT solvers, which serve as a foundation for these tools, have thus increased support for bitvector reasoning through bit-blasting and linear arithmetic approximations. In this paper we show that similar linear arithmetic approximation of bitvector operations can be done at the source level through transformations. Specifically, we introduce new paths that over-approximate bitvector operations with linear conditions/constraints, increasing branching but allowing us to better exploit the well-developed integer reasoning and interpolation of verification tools. We show that, for reachability of bitvector programs, increased branching incurs negligible overhead yet, when combined with integer interpolation optimizations, enables more programs to be verified. We further show this exploitation of integer interpolation in the common case also enables competitive termination verification of bitvector programs and leads to the first effective technique for LTL verification of bitvector programs. Finally, we provide an in-depth case study of decompiled ("lifted") binary programs, which emulate X86 execution through frequent use of bitvector operations. We present a new tool DarkSea, the first tool capable of verifying reachability, termination, and LTL of lifted binaries.
△ Less
Submitted 28 August, 2021; v1 submitted 11 May, 2021;
originally announced May 2021.
-
Viko: An Adaptive Gecko Gripper with Vision-based Tactile Sensor
Authors:
Chohei Pang,
Kinwing Mak,
Yazhan Zhang,
Yang Yang,
Yu Alexander Tse,
Michael Yu Wang
Abstract:
Monitoring the state of contact is essential for robotic devices, especially grippers that implement gecko-inspired adhesives where intimate contact is crucial for a firm attachment. However, due to the lack of deformable sensors, few have demonstrated tactile sensing for gecko grippers. We present Viko, an adaptive gecko gripper that utilizes vision-based tactile sensors to monitor contact state.…
▽ More
Monitoring the state of contact is essential for robotic devices, especially grippers that implement gecko-inspired adhesives where intimate contact is crucial for a firm attachment. However, due to the lack of deformable sensors, few have demonstrated tactile sensing for gecko grippers. We present Viko, an adaptive gecko gripper that utilizes vision-based tactile sensors to monitor contact state. The sensor provides high-resolution real-time measurements of contact area and shear force. Moreover, the sensor is adaptive, low-cost, and compact. We integrated gecko-inspired adhesives into the sensor surface without impeding its adaptiveness and performance. Using a robotic arm, we evaluate the performance of the gripper by a series of gras** test. The gripper has a maximum payload of 8N even at a low fingertip pitch angle of 30 degrees. We also showcase the gripper's ability to adjust fingertip pose for better contact using sensor feedback. Further, everyday object picking is presented as a demonstration of the gripper's adaptiveness.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Towards Optimal Use of Exception Handling Information for Function Detection
Authors:
Chengbin Pang,
Ruotong Yu,
Dongpeng Xu,
Eric Koskinen,
Georgios Portokalidis,
Jun Xu
Abstract:
Function entry detection is critical for security of binary code. Conventional methods heavily rely on patterns, inevitably missing true functions and introducing errors. Recently, call frames have been used in exception-handling for function start detection. However, existing methods have two problems. First, they combine call frames with heuristic-based approaches, which often brings error and u…
▽ More
Function entry detection is critical for security of binary code. Conventional methods heavily rely on patterns, inevitably missing true functions and introducing errors. Recently, call frames have been used in exception-handling for function start detection. However, existing methods have two problems. First, they combine call frames with heuristic-based approaches, which often brings error and uncertain benefits. Second, they trust the fidelity of call frames, without handling the errors that are introduced by call frames. In this paper, we first study the coverage and accuracy of existing approaches in detecting function starts using call frames. We found that recursive disassembly with call frames can maximize coverage, and using extra heuristic-based approaches does not improve coverage and actually hurts accuracy. Second, we unveil call-frame errors and develop the first approach to fix them, making their use more reliable.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Multi-source Transfer Learning with Ensemble for Financial Time Series Forecasting
Authors:
Qi-Qiao He,
Patrick Cheong-Iao Pang,
Yain-Whar Si
Abstract:
Although transfer learning is proven to be effective in computer vision and natural language processing applications, it is rarely investigated in forecasting financial time series. Majority of existing works on transfer learning are based on single-source transfer learning due to the availability of open-access large-scale datasets. However, in financial domain, the lengths of individual time ser…
▽ More
Although transfer learning is proven to be effective in computer vision and natural language processing applications, it is rarely investigated in forecasting financial time series. Majority of existing works on transfer learning are based on single-source transfer learning due to the availability of open-access large-scale datasets. However, in financial domain, the lengths of individual time series are relatively short and single-source transfer learning models are less effective. Therefore, in this paper, we investigate multi-source deep transfer learning for financial time series. We propose two multi-source transfer learning methods namely Weighted Average Ensemble for Transfer Learning (WAETL) and Tree-structured Parzen Estimator Ensemble Selection (TPEES). The effectiveness of our approach is evaluated on financial time series extracted from stock markets. Experiment results reveal that TPEES outperforms other baseline methods on majority of multi-source transfer tasks.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora
Authors:
Xuan Ouyang,
Shuohuan Wang,
Chao Pang,
Yu Sun,
Hao Tian,
Hua Wu,
Haifeng Wang
Abstract:
Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of paralle…
▽ More
Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.
△ Less
Submitted 17 September, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Capacity-achieving Polar-based LDGM Codes
Authors:
James Chin-Jen Pang,
Hessam Mahdavifar,
S. Sandeep Pradhan
Abstract:
In this paper, we study codes with sparse generator matrices. More specifically, low-density generator matrix (LDGM) codes with a certain constraint on the weight of the columns in the generator matrix are considered. In this paper, it is first shown that when a BMS channel W and a constant s>0 are given, there exists a polarization kernel such that the corresponding polar code is capacity-achievi…
▽ More
In this paper, we study codes with sparse generator matrices. More specifically, low-density generator matrix (LDGM) codes with a certain constraint on the weight of the columns in the generator matrix are considered. In this paper, it is first shown that when a BMS channel W and a constant s>0 are given, there exists a polarization kernel such that the corresponding polar code is capacity-achieving and the column weights of the generator matrix (GM) are bounded from above by $N^s$. Then, a general construction based on a concatenation of polar codes and a rate-$1$ code, and a new column-splitting algorithm that guarantees a much sparser GM, is given. More specifically, for any BMS channel and any $ε> 2ε^*$, where $ε^* \approx 0.085$, an existence of a sequence of capacity-achieving codes with all the GM column weights upper bounded by $(\log N)^{1+ε}$ is shown. Furthermore, two coding schemes for BEC and BMS channels, based on a second column-splitting algorithm, are devised with low-complexity decoding that uses successive-cancellation. The second splitting algorithm allows for the use of a low-complexity decoder by preserving the reliability of the bit-channels observed by the source bits, and by increasing the code block length. The concatenation-based construction can also be applied to the random linear code ensemble to yield capacity-achieving codes with all the GM column weights being $O(\log N)$ and with (large-degree) polynomial decoding complexity.
△ Less
Submitted 27 June, 2022; v1 submitted 27 December, 2020;
originally announced December 2020.
-
Origami-based Shape Morphing Fingertip to Enhance Gras** Stability and Dexterity
Authors:
Zicheng Kan,
Yazhan Zhang,
Chohei Pang,
Michael Yu Wang
Abstract:
Adaptation to various scene configurations and object properties, stability and dexterity in robotic gras** manipulation is far from explored. This work presents an origami-based shape morphing fingertip design to actively tackle the gras** stability and dexterity problems. The proposed fingertip utilizes origami as its skeleton providing degrees of freedom at desired positions and motor-drive…
▽ More
Adaptation to various scene configurations and object properties, stability and dexterity in robotic gras** manipulation is far from explored. This work presents an origami-based shape morphing fingertip design to actively tackle the gras** stability and dexterity problems. The proposed fingertip utilizes origami as its skeleton providing degrees of freedom at desired positions and motor-driven four-bar-linkages as its transmission components to achieve a compact size of the fingertip. 3 morphing types that are commonly observed and essential in robotic gras** are studied and validated with geometrical modeling. Experiments including gras** an object with convex point contact to pivot or do pinch gras**, grasped object reorientation, and envelo** gras** with concave fingertip surfaces are implemented to demonstrate the advantages of our fingertip compared to conventional parallel grippers. Multi-functionality on enhancing gras** stability and dexterity via active adaptation given different grasped objects and manipulation tasks are justified. Video is available at youtu.be/jJoJ3xnDdVk/.
△ Less
Submitted 10 October, 2020;
originally announced October 2020.
-
SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask
Authors:
Chengbin Pang,
Ruotong Yu,
Yaohui Chen,
Eric Koskinen,
Georgios Portokalidis,
Bing Mao,
Jun Xu
Abstract:
Disassembly of binary code is hard, but necessary for improving the security of binary software. Over the past few decades, research in binary disassembly has produced many tools and frameworks, which have been made available to researchers and security professionals. These tools employ a variety of strategies that grant them different characteristics. The lack of systematization, however, impedes…
▽ More
Disassembly of binary code is hard, but necessary for improving the security of binary software. Over the past few decades, research in binary disassembly has produced many tools and frameworks, which have been made available to researchers and security professionals. These tools employ a variety of strategies that grant them different characteristics. The lack of systematization, however, impedes new research in the area and makes selecting the right tool hard, as we do not understand the strengths and weaknesses of existing tools. In this paper, we systematize binary disassembly through the study of nine popular, open-source tools. We couple the manual examination of their code bases with the most comprehensive experimental evaluation (thus far) using 3,788 binaries. Our study yields a comprehensive description and organization of strategies for disassembly, classifying them as either algorithm or else heuristic. Meanwhile, we measure and report the impact of individual algorithms on the results of each tool. We find that while principled algorithms are used by all tools, they still heavily rely on heuristics to increase code coverage. Depending on the heuristics used, different coverage-vs-correctness trade-offs come in play, leading to tools with different strengths and weaknesses. We envision that these findings will help users pick the right tool and assist researchers in improving binary disassembly.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.