-
WaDec: Decompile WebAssembly Using Large Language Model
Authors:
Xinyu She,
Yanjie Zhao,
Haoyu Wang
Abstract:
WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when debugging or analyzing web applications. Therefore…
▽ More
WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when debugging or analyzing web applications. Therefore, effective decompilation becomes crucial. Unfortunately, traditional decompilers often struggle with producing readable outputs. While some large language model (LLM)-based decompilers have shown good compatibility with general binary files, they still face specific challenges when dealing with Wasm.
In this paper, we introduce a novel approach, WaDec, which is the first use of a fine-tuned LLM to interpret and decompile Wasm binary code into a higher-level, more comprehensible source code representation. The LLM was meticulously fine-tuned using a specialized dataset of wat-c code snippets, employing self-supervised learning techniques. This enables WaDec to effectively decompile not only complete wat functions but also finer-grained wat code snippets. Our experiments demonstrate that WaDec markedly outperforms current state-of-the-art tools, offering substantial improvements across several metrics. It achieves a code inflation rate of only 3.34%, a dramatic 97% reduction compared to the state-of-the-art's 116.94%. Unlike baselines' output that cannot be directly compiled or executed, WaDec maintains a recompilability rate of 52.11%, a re-execution rate of 43.55%, and an output consistency of 27.15%. Additionally, it significantly exceeds state-of-the-art performance in AST edit distance by 185%, cyclomatic complexity by 8%, and cosine similarity by 41%, achieving an average code similarity above 50%.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
WRTester: Differential Testing of WebAssembly Runtimes via Semantic-aware Binary Generation
Authors:
Shangtong Cao,
Ningyu He,
Xinyu She,
Yixuan Zhang,
Mu Zhang,
Haoyu Wang
Abstract:
Wasm runtime is a fundamental component in the Wasm ecosystem, as it directly impacts whether Wasm applications can be executed as expected. Bugs in Wasm runtime bugs are frequently reported, thus our research community has made a few attempts to design automated testing frameworks for detecting bugs in Wasm runtimes. However, existing testing frameworks are limited by the quality of test cases, i…
▽ More
Wasm runtime is a fundamental component in the Wasm ecosystem, as it directly impacts whether Wasm applications can be executed as expected. Bugs in Wasm runtime bugs are frequently reported, thus our research community has made a few attempts to design automated testing frameworks for detecting bugs in Wasm runtimes. However, existing testing frameworks are limited by the quality of test cases, i.e., they face challenges of generating both semantic-rich and syntactic-correct Wasm binaries, thus complicated bugs cannot be triggered. In this work, we present WRTester, a novel differential testing framework that can generated complicated Wasm test cases by disassembling and assembling of real-world Wasm binaries, which can trigger hidden inconsistencies among Wasm runtimes. For further pinpointing the root causes of unexpected behaviors, we design a runtime-agnostic root cause location method to accurately locate bugs. Extensive evaluation suggests that WRTester outperforms SOTA techniques in terms of both efficiency and effectiveness. We have uncovered 33 unique bugs in popular Wasm runtimes, among which 25 have been confirmed.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Authors:
Xinyu She,
Yue Liu,
Yanjie Zhao,
Yiling He,
Li Li,
Chakkrit Tantithamthavorn,
Zhan Qin,
Haoyu Wang
Abstract:
Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic perfor…
▽ More
Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case generation. Despite their great potential, language models for code intelligence (LM4Code) are susceptible to potential pitfalls, which hinder realistic performance and further impact their reliability and applicability in real-world deployment. Such challenges drive the need for a comprehensive understanding - not just identifying these issues but delving into their possible implications and existing solutions to build more reliable language models tailored to code intelligence. Based on a well-defined systematic research approach, we conducted an extensive literature review to uncover the pitfalls inherent in LM4Code. Finally, 67 primary studies from top-tier venues have been identified. After carefully examining these studies, we designed a taxonomy of pitfalls in LM4Code research and conducted a systematic study to summarize the issues, implications, current solutions, and challenges of different pitfalls for LM4Code systems. We developed a comprehensive classification scheme that dissects pitfalls across four crucial aspects: data collection and labeling, system design and learning, performance evaluation, and deployment and maintenance. Through this study, we aim to provide a roadmap for researchers and practitioners, facilitating their understanding and utilization of LM4Code in reliable and trustworthy ways.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Dynamic Resource Management in CDRT Systems through Adaptive NOMA
Authors:
Hongjiang Lei,
Mingxu Yang,
Ki-Hong Park,
Nasir Saeed,
Xusheng She,
Jianling Cao
Abstract:
This paper introduces a novel adaptive transmission scheme to amplify the prowess of coordinated direct and relay transmission (CDRT) systems rooted in non-orthogonal multiple access principles. Leveraging the maximum ratio transmission scheme, we seamlessly meet the prerequisites of CDRT while harnessing the potential of dynamic power allocation and directional antennas to elevate the system's op…
▽ More
This paper introduces a novel adaptive transmission scheme to amplify the prowess of coordinated direct and relay transmission (CDRT) systems rooted in non-orthogonal multiple access principles. Leveraging the maximum ratio transmission scheme, we seamlessly meet the prerequisites of CDRT while harnessing the potential of dynamic power allocation and directional antennas to elevate the system's operational efficiency. Through meticulous derivations, we unveil closed-form expressions depicting the exact effective sum throughput. Our simulation results adeptly validate the theoretical analysis and vividly showcase the effectiveness of the proposed scheme.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
Learning Point Processes using Recurrent Graph Network
Authors:
Saurabh Dash,
Xueyuan She,
Saibal Mukhopadhyay
Abstract:
We present a novel Recurrent Graph Network (RGN) approach for predicting discrete marked event sequences by learning the underlying complex stochastic process. Using the framework of Point Processes, we interpret a marked discrete event sequence as the superposition of different sequences each of a unique type. The nodes of the Graph Network use LSTM to incorporate past information whereas a Graph…
▽ More
We present a novel Recurrent Graph Network (RGN) approach for predicting discrete marked event sequences by learning the underlying complex stochastic process. Using the framework of Point Processes, we interpret a marked discrete event sequence as the superposition of different sequences each of a unique type. The nodes of the Graph Network use LSTM to incorporate past information whereas a Graph Attention Network (GAT Network) introduces strong inductive biases to capture the interaction between these different types of events. By changing the self-attention mechanism from attending over past events to attending over event types, we obtain a reduction in time and space complexity from $\mathcal{O}(N^2)$ (total number of events) to $\mathcal{O}(|\mathcal{Y}|^2)$ (number of event types). Experiments show that the proposed approach improves performance in log-likelihood, prediction and goodness-of-fit tasks with lower time and space complexity compared to state-of-the art Transformer based architectures.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
On Secure NOMA-CDRT Systems with Physical Layer Network Coding
Authors:
Hongjiang Lei,
Xusheng She,
Ki-Hong Park,
Imran Shafique Ansari,
Zheng Shi,
**g Jiang,
Mohamed-Slim Alouini
Abstract:
This paper proposes a new scheme to enhance the secrecy performance of a NOMA-based coordinated direct relay transmission system (NOMA-CDRT) with an untrusted relay. The physical-layer network coding and the non-orthogonal multiple access scheme are combined to improve the spectrum efficiency. Furthermore, inter-user interference and friendly jamming signals are utilized to suppress the eavesdropp…
▽ More
This paper proposes a new scheme to enhance the secrecy performance of a NOMA-based coordinated direct relay transmission system (NOMA-CDRT) with an untrusted relay. The physical-layer network coding and the non-orthogonal multiple access scheme are combined to improve the spectrum efficiency. Furthermore, inter-user interference and friendly jamming signals are utilized to suppress the eavesdrop** ability of the untrusted relay without affecting the acceptance quality of legitimate users. Specifically, the far user in the first slot and the near user in the second slot act as jammers to generate jamming signals to ensure secure transmissions of the confidential signals. We investigate the secrecy performance of the proposed scheme in NOMA-CDRT systems and derive the closed-form expression for the ergodic secrecy sum rate. The asymptotic analysis at high signal-to-noise ratio is performed to obtain more insights. Finally, simulation results are presented to demonstrate the effectiveness of the proposed scheme and the correctness of the theoretical analysis.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
A Fully Spiking Hybrid Neural Network for Energy-Efficient Object Detection
Authors:
Biswadeep Chakraborty,
Xueyuan She,
Saibal Mukhopadhyay
Abstract:
This paper proposes a Fully Spiking Hybrid Neural Network (FSHNN) for energy-efficient and robust object detection in resource-constrained platforms. The network architecture is based on Convolutional SNN using leaky-integrate-fire neuron models. The model combines unsupervised Spike Time-Dependent Plasticity (STDP) learning with back-propagation (STBP) learning methods and also uses Monte Carlo D…
▽ More
This paper proposes a Fully Spiking Hybrid Neural Network (FSHNN) for energy-efficient and robust object detection in resource-constrained platforms. The network architecture is based on Convolutional SNN using leaky-integrate-fire neuron models. The model combines unsupervised Spike Time-Dependent Plasticity (STDP) learning with back-propagation (STBP) learning methods and also uses Monte Carlo Dropout to get an estimate of the uncertainty error. FSHNN provides better accuracy compared to DNN based object detectors while being 150X energy-efficient. It also outperforms these object detectors, when subjected to noisy input data and less labeled training data with a lower uncertainty error.
△ Less
Submitted 23 July, 2021; v1 submitted 21 April, 2021;
originally announced April 2021.
-
Reversible data hiding in encrypted images based on pixel prediction and multi-MSB planes rearrangement
Authors:
Zhaoxia Yin,
Xiaomeng She,
** Tang,
Bin Luo
Abstract:
Great concern has arisen in the field of reversible data hiding in encrypted images (RDHEI) due to the development of cloud storage and privacy protection. RDHEI is an effective technology that can embed additional data after image encryption, extract additional data error-free and reconstruct original images losslessly. In this paper, a high-capacity and fully reversible RDHEI method is proposed,…
▽ More
Great concern has arisen in the field of reversible data hiding in encrypted images (RDHEI) due to the development of cloud storage and privacy protection. RDHEI is an effective technology that can embed additional data after image encryption, extract additional data error-free and reconstruct original images losslessly. In this paper, a high-capacity and fully reversible RDHEI method is proposed, which is based on pixel prediction and multi-MSB (most significant bit) planes rearrangement. First, the median edge detector (MED) predictor is used to calculate the predicted value. Next, unlike previous methods, in our proposed method, signs of prediction errors (PEs) are represented by one bit plane and absolute values of PEs are represented by other bit planes. Then, we divide bit planes into uniform blocks and non-uniform blocks, and rearrange these blocks. Finally, according to different pixel prediction schemes, different numbers of additional data are embedded adaptively. The experimental results prove that our method has higher embedding capacity compared with state-of-the-art RDHEI methods.
△ Less
Submitted 20 March, 2021; v1 submitted 8 July, 2020;
originally announced July 2020.
-
Improving Robustness of ReRAM-based Spiking Neural Network Accelerator with Stochastic Spike-timing-dependent-plasticity
Authors:
Xueyuan She,
Yun Long,
Saibal Mukhopadhyay
Abstract:
Spike-timing-dependent-plasticity (STDP) is an unsupervised learning algorithm for spiking neural network (SNN), which promises to achieve deeper understanding of human brain and more powerful artificial intelligence. While conventional computing system fails to simulate SNN efficiently, process-in-memory (PIM) based on devices such as ReRAM can be used in designing fast and efficient STDP based S…
▽ More
Spike-timing-dependent-plasticity (STDP) is an unsupervised learning algorithm for spiking neural network (SNN), which promises to achieve deeper understanding of human brain and more powerful artificial intelligence. While conventional computing system fails to simulate SNN efficiently, process-in-memory (PIM) based on devices such as ReRAM can be used in designing fast and efficient STDP based SNN accelerators, as it operates in high resemblance with biological neural network. However, the real-life implementation of such design still suffers from impact of input noise and device variation. In this work, we present a novel stochastic STDP algorithm that uses spiking frequency information to dynamically adjust synaptic behavior. The algorithm is tested in pattern recognition task with noisy input and shows accuracy improvement over deterministic STDP. In addition, we show that the new algorithm can be used for designing a robust ReRAM based SNN accelerator that has strong resilience to device variation.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction
Authors:
Xueyuan She,
Yun Long,
Daehyun Kim,
Saibal Mukhopadhyay
Abstract:
Deep neural networks (DNNs) provide high image classification accuracy, but experience significant performance degradation when perturbation from various sources are present in the input. The lack of resilience to input perturbations makes DNN less reliable for systems interacting with physical world such as autonomous vehicles, robotics, to name a few, where imperfect input is the normal conditio…
▽ More
Deep neural networks (DNNs) provide high image classification accuracy, but experience significant performance degradation when perturbation from various sources are present in the input. The lack of resilience to input perturbations makes DNN less reliable for systems interacting with physical world such as autonomous vehicles, robotics, to name a few, where imperfect input is the normal condition. We present a hybrid deep network architecture with spike-assisted contextual information extraction (ScieNet). ScieNet integrates unsupervised learning using spiking neural network (SNN) for unsupervised contextual informationextraction with a back-end DNN trained for classification. The integrated network demonstrates high resilience to input perturbations without relying on prior training on perturbed inputs. We demonstrate ScieNet with different back-end DNNs for image classification using CIFAR dataset considering stochastic (noise) and structured (rain) input perturbations. Experimental results demonstrate significant improvement in accuracy on noisy and rainy images without prior training, while maintaining state-of-the-art accuracy on clean images.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
HybridNet: Integrating Model-based and Data-driven Learning to Predict Evolution of Dynamical Systems
Authors:
Yun Long,
Xueyuan She,
Saibal Mukhopadhyay
Abstract:
The robotic systems continuously interact with complex dynamical systems in the physical world. Reliable predictions of spatiotemporal evolution of these dynamical systems, with limited knowledge of system dynamics, are crucial for autonomous operation. In this paper, we present HybridNet, a framework that integrates data-driven deep learning and model-driven computation to reliably predict spatio…
▽ More
The robotic systems continuously interact with complex dynamical systems in the physical world. Reliable predictions of spatiotemporal evolution of these dynamical systems, with limited knowledge of system dynamics, are crucial for autonomous operation. In this paper, we present HybridNet, a framework that integrates data-driven deep learning and model-driven computation to reliably predict spatiotemporal evolution of a dynamical systems even with in-exact knowledge of their parameters. A data-driven deep neural network (DNN) with Convolutional LSTM (ConvLSTM) as the backbone is employed to predict the time-varying evolution of the external forces/perturbations. On the other hand, the model-driven computation is performed using Cellular Neural Network (CeNN), a neuro-inspired algorithm to model dynamical systems defined by coupled partial differential equations (PDEs). CeNN converts the intricate numerical computation into a series of convolution operations, enabling a trainable PDE solver. With a feedback control loop, HybridNet can learn the physical parameters governing the system's dynamics in real-time, and accordingly adapt the computation models to enhance prediction accuracy for time-evolving dynamical systems. The experimental results on two dynamical systems, namely, heat convection-diffusion system, and fluid dynamical system, demonstrate that the HybridNet produces higher accuracy than the state-of-the-art deep learning based approach.
△ Less
Submitted 5 January, 2019; v1 submitted 19 June, 2018;
originally announced June 2018.
-
Overload Control for Scaling WeChat Microservices
Authors:
Hao Zhou,
Ming Chen,
Qian Lin,
Yong Wang,
Xiaobin She,
Sifan Liu,
Rui Gu,
Beng Chin Ooi,
Junfeng Yang
Abstract:
Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally, the design of overload control is ad-hoc for individual service. However, service-specific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementation of service. Service developers usually have d…
▽ More
Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally, the design of overload control is ad-hoc for individual service. However, service-specific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementation of service. Service developers usually have difficulty to accurately estimate the dynamics of actual workload during the development of service. Therefore, it is essential to decouple the overload control from service logic. In this paper, we propose DAGOR, an overload control scheme designed for the account-oriented microservice architecture. DAGOR is service agnostic and system-centric. It manages overload at the microservice granule such that each microservice monitors its load status in real time and triggers load shedding in a collaborative manner among its relevant services when overload is detected. DAGOR has been used in the WeChat backend for five years. Experimental results show that DAGOR can benefit high success rate of service even when the system is experiencing overload, while ensuring fairness in the overload control.
△ Less
Submitted 23 December, 2018; v1 submitted 11 June, 2018;
originally announced June 2018.