Search | arXiv e-print repository

Deep Perspective Transformation Based Vehicle Localization on Bird's Eye View

Authors: Abtin Mahyar, Hossein Motamednia, Dara Rahmati

Abstract: An accurate understanding of a self-driving vehicle's surrounding environment is crucial for its navigation system. To enhance the effectiveness of existing algorithms and facilitate further research, it is essential to provide comprehensive data to the routing system. Traditional approaches rely on installing multiple sensors to simulate the environment, leading to high costs and complexity. In t… ▽ More An accurate understanding of a self-driving vehicle's surrounding environment is crucial for its navigation system. To enhance the effectiveness of existing algorithms and facilitate further research, it is essential to provide comprehensive data to the routing system. Traditional approaches rely on installing multiple sensors to simulate the environment, leading to high costs and complexity. In this paper, we propose an alternative solution by generating a top-down representation of the scene, enabling the extraction of distances and directions of other cars relative to the ego vehicle. We introduce a new synthesized dataset that offers extensive information about the ego vehicle and its environment in each frame, providing valuable resources for similar downstream tasks. Additionally, we present an architecture that transforms perspective view RGB images into bird's-eye-view maps with segmented surrounding vehicles. This approach offers an efficient and cost-effective method for capturing crucial environmental information for self-driving cars. Code and dataset are available at https://github.com/IPM-HPC/Perspective-BEV-Transformer. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: 7 pages, 2 figures

arXiv:2207.05676 [pdf, other]

HyperDbg: Reinventing Hardware-Assisted Debugging (Extended Version)

Authors: Mohammad Sina Karvandi, MohammadHossein Gholamrezaei, Saleh Khalaj Monfared, Soroush Meghdadizanjani, Behrooz Abbassi, Ali Amini, Reza Mortazavi, Saeid Gorgin, Dara Rahmati, Michael Schwarz

Abstract: Software analysis, debugging, and reverse engineering have a crucial impact in today's software industry. Efficient and stealthy debuggers are especially relevant for malware analysis. However, existing debugging platforms fail to address a transparent, effective, and high-performance low-level debugger due to their detectable fingerprints, complexity, and implementation restrictions. In this pape… ▽ More Software analysis, debugging, and reverse engineering have a crucial impact in today's software industry. Efficient and stealthy debuggers are especially relevant for malware analysis. However, existing debugging platforms fail to address a transparent, effective, and high-performance low-level debugger due to their detectable fingerprints, complexity, and implementation restrictions. In this paper, we present HyperDbg, a new hypervisor-assisted debugger for high-performance and stealthy debugging of user and kernel applications. To accomplish this, HyperDbg relies on state-of-the-art hardware features available in today's CPUs, such as VT-x and extended page tables. In contrast to other widely used existing debuggers, we design HyperDbg using a custom hypervisor, making it independent of OS functionality or API. We propose hardware-based instruction-level emulation and OS-level API hooking via extended page tables to increase the stealthiness. Our results of the dynamic analysis of 10,853 malware samples show that HyperDbg's stealthiness allows debugging on average 22% and 26% more samples than WinDbg and x64dbg, respectively. Moreover, in contrast to existing debuggers, HyperDbg is not detected by any of the 13 tested packers and protectors. We improve the performance over other debuggers by deploying a VMX-compatible script engine, eliminating unnecessary context switches. Our experiment on three concrete debugging scenarios shows that compared to WinDbg as the only kernel debugger, HyperDbg performs step-in, conditional breaks, and syscall recording, 2.98x, 1319x, and 2018x faster, respectively. We finally show real-world applications, such as a 0-day analysis, structure reconstruction for reverse engineering, software performance analysis, and code-coverage analysis. △ Less

Submitted 2 September, 2022; v1 submitted 29 May, 2022; originally announced July 2022.

arXiv:2010.05197 [pdf, other]

doi 10.1109/ISCAS45731.2020.9181001

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

Authors: Reza Hojabr, Kamyar Givaki, Kossar Pourahmadi, Parsa Nooralinejad, Ahmad Khonsari, Dara Rahmati, M. Hassan Najafi

Abstract: Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the real-world environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work,… ▽ More Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs) to be able to interact with the real-world environment. This interaction comes with the ability to retrain DNNs, since environmental conditions change continuously in time. Stochastic Gradient Descent (SGD) is a widely used algorithm to train DNNs by optimizing the parameters over the training data iteratively. In this work, first we present a novel approach to add the training ability to a baseline DNN accelerator (inference only) by splitting the SGD algorithm into simple computational elements. Then, based on this heuristic approach we propose TaxoNN, a light-weight accelerator for DNN training. TaxoNN can easily tune the DNN weights by reusing the hardware resources used in the inference process using a time-multiplexing approach and low-bitwidth units. Our experimental results show that TaxoNN delivers, on average, 0.97% higher misclassification rate compared to a full-precision implementation. Moreover, TaxoNN provides 2.1$\times$ power saving and 1.65$\times$ area reduction over the state-of-the-art DNN training accelerator. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Comments: Accepted to ISCAS 2020. 5 pages, 5 figures

Journal ref: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1-5

arXiv:2005.14156 [pdf, other]

Unlucky Explorer: A Complete non-Overlap** Map Exploration

Authors: Mohammad Sina Kiarostami, Saleh Khalaj Monfared, Mohammadreza Daneshvaramoli, Ali Oliayi, Negar Yousefian, Dara Rahmati, Saeid Gorgin

Abstract: Nowadays, the field of Artificial Intelligence in Computer Games (AI in Games) is going to be more alluring since computer games challenge many aspects of AI with a wide range of problems, particularly general problems. One of these kinds of problems is Exploration, which states that an unknown environment must be explored by one or several agents. In this work, we have first introduced the Maze D… ▽ More Nowadays, the field of Artificial Intelligence in Computer Games (AI in Games) is going to be more alluring since computer games challenge many aspects of AI with a wide range of problems, particularly general problems. One of these kinds of problems is Exploration, which states that an unknown environment must be explored by one or several agents. In this work, we have first introduced the Maze Dash puzzle as an exploration problem where the agent must find a Hamiltonian Path visiting all the cells. Then, we have investigated to find suitable methods by a focus on Monte-Carlo Tree Search (MCTS) and SAT to solve this puzzle quickly and accurately. An optimization has been applied to the proposed MCTS algorithm to obtain a promising result. Also, since the prefabricated test cases of this puzzle are not large enough to assay the proposed method, we have proposed and employed a technique to generate solvable test cases to evaluate the approaches. Eventually, the MCTS-based method has been assessed by the auto-generated test cases and compared with our implemented SAT approach that is considered a good rival. Our comparison indicates that the MCTS-based approach is an up-and-coming method that could cope with the test cases with small and medium sizes with faster run-time compared to SAT. However, for certain discussed reasons, including the features of the problem, tree search organization, and also the approach of MCTS in the Simulation step, MCTS takes more time to execute in Large size scenarios. Consequently, we have found the bottleneck for the MCTS-based method in significant test cases that could be improved in two real-world problems. △ Less

Submitted 28 May, 2020; originally announced May 2020.

arXiv:2005.10333 [pdf, other]

A Way Around UMIP and Descriptor-Table Exiting via TSX-based Side-Channel

Authors: Mohammad Sina Karvandi, Saleh Khalaj Monfared, Mohammad Sina Kiarostami, Dara Rahmati, Saeid Gorgin

Abstract: Nowadays, in operating systems, numerous protection mechanisms prevent or limit the user-mode applicationsto access the kernels internal information. This is regularlycarried out by software-based defenses such as Address Space Layout Randomization (ASLR) and Kernel ASLR(KASLR). They play pronounced roles when the security of sandboxed applications such as Web-browser are considered.Armed with arb… ▽ More Nowadays, in operating systems, numerous protection mechanisms prevent or limit the user-mode applicationsto access the kernels internal information. This is regularlycarried out by software-based defenses such as Address Space Layout Randomization (ASLR) and Kernel ASLR(KASLR). They play pronounced roles when the security of sandboxed applications such as Web-browser are considered.Armed with arbitrary write access in the kernel memory, if these protections are bypassed, an adversary could find a suitable where to write in order to get an elevation of privilege or code execution in ring 0. In this paper, we introduce a reliable method based on Transactional Synchronization Extensions (TSX) side-channel leakage to reveal the address of the Global Descriptor Table (GDT) and Interrupt Descriptor Table (IDT). We indicate that by detecting these addresses, one could execute instructions to sidestep the Intels User-Mode InstructionPrevention (UMIP) and the Hypervisor-based mitigation and, consequently, neutralized them. The introduced method is successfully performed after the most recent patches for Meltdown and Spectre. Moreover, the implementation of the proposed approach on different platforms, including the latest releases of Microsoft Windows, Linux, and, Mac OSX with the latest 9th generation of Intel processors, shows that the proposed mechanism is independent from the Operating System implementation. We demonstrate that a combinationof this method with call-gate mechanism (available in modernprocessors) in a chain of events will eventually lead toa system compromise despite the limitations of a super-secure sandboxed environment in the presence of Windows proprietary Virtualization Based Security (VBS). Finally, we suggest the software-based mitigation to avoid these issues with an acceptable overhead cost. △ Less

Submitted 22 April, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:2001.00053 [pdf, other]

On the Resilience of Deep Learning for Reduced-voltage FPGAs

Authors: Kamyar Givaki, Behzad Salami, Reza Hojabr, S. M. Reza Tayaranian, Ahmad Khonsari, Dara Rahmati, Saeid Gorgin, Adrian Cristal, Osman S. Unsal

Abstract: Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for p… ▽ More Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for power dissipation minimization. Unfortunately, bit-flip faults start to appear as the voltage is scaled down closer to the transistor threshold due to timing issues, thus creating a resilience issue. This paper experimentally evaluates the resilience of the training phase of DNNs in the presence of voltage underscaling related faults of FPGAs, especially in on-chip memories. Toward this goal, we have experimentally evaluated the resilience of LeNet-5 and also a specially designed network for CIFAR-10 dataset with different activation functions of Rectified Linear Unit (Relu) and Hyperbolic Tangent (Tanh). We have found that modern FPGAs are robust enough in extremely low-voltage levels and that low-voltage related faults can be automatically masked within the training iterations, so there is no need for costly software- or hardware-oriented fault mitigation techniques like ECC. Approximately 10% more training iterations are needed to fill the gap in the accuracy. This observation is the result of the relatively low rate of undervolting faults, i.e., <0.1\%, measured on real FPGA fabrics. We have also increased the fault rate significantly for the LeNet-5 network by randomly generated fault injection campaigns and observed that the training accuracy starts to degrade. When the fault rate increases, the network with Tanh activation function outperforms the one with Relu in terms of accuracy, e.g., when the fault rate is 30% the accuracy difference is 4.92%. △ Less

Submitted 26 December, 2019; originally announced January 2020.

arXiv:1910.12062 [pdf, other]

Decentralized Cooperative Communication-less Multi-Agent Task Assignment with Monte-Carlo Tree Search

Authors: Mohammadreza Daneshvaramoli, Mohammad Sina Kiarostami, Saleh Khalaj Monfared, Helia Karisani, Hamed Khashehchi, Dara Rahmati, Saeid Gorgin, Amir Rahmati

Abstract: Cooperative task assignment is an important subject in multi-agent systems with a wide range of applications. These systems are usually designed with massive communication among the agents to minimize the error in pursuit of the general goal of the entire system. In this work, we propose a novel approach for Decentralized Cooperative Communication-less Multi-Agent Task Assignment (DCCMATA) employi… ▽ More Cooperative task assignment is an important subject in multi-agent systems with a wide range of applications. These systems are usually designed with massive communication among the agents to minimize the error in pursuit of the general goal of the entire system. In this work, we propose a novel approach for Decentralized Cooperative Communication-less Multi-Agent Task Assignment (DCCMATA) employing Monte-Carlo Tree Search (MCTS). Here, each agent can assign the optimal task by itself for itself. We design the system to automatically maximize the success rate, achieving the collective goal effectively. To put it another way, the agents optimally compute each following step, only by knowing the current location of other agents, with no additional communication overhead. In contrast with the previously proposed methods which rely on the task assignment procedure for similar problems, we describe a method in which the agents move towards the collective goal. This may lead to scenarios where some agents not necessarily move towards the closest goal. However, the total efficiency (makespan) and effectiveness (success ratio) in these cases are significantly improved. To evaluate our approach, we have tested the algorithm with a wide range of parameters(agents, size, goal). Our implementation completely solves (Success Rate = %100) a 20*20 grid with 20 goals by 20 agents in 7.9 s runtime for each agent. Also, the proposed algorithm runs with the complexity of O(N^2I^2 + IN^4), where the I and N are the MCTS iterative index and grid size, respectively. △ Less

Submitted 23 February, 2020; v1 submitted 26 October, 2019; originally announced October 2019.

arXiv:1909.04750 [pdf, other]

Generating High Quality Random Numbers: A High Throughput Parallel Bitsliced Approach

Authors: Saleh Khalaj Monfared, Omid Hajihassani, Soroush Meghdadi Zanjani, Mohammadsina Kiarostami, Dara Rahmati, Saeid Gorgin

Abstract: In this work, by employing a bitsliced data representation as building blocks of algorithms, we showcase the capability and scalability of our proposed method in a variety of PRNG methods in the category of block and stream ciphers. While demonstrating the suitability of stream-ciphers for high throughput PRNG, as an example, we implement and investigate a bitsliced MICKEY 2.0 PRNG by altering the… ▽ More In this work, by employing a bitsliced data representation as building blocks of algorithms, we showcase the capability and scalability of our proposed method in a variety of PRNG methods in the category of block and stream ciphers. While demonstrating the suitability of stream-ciphers for high throughput PRNG, as an example, we implement and investigate a bitsliced MICKEY 2.0 PRNG by altering the paradigm of internal functions and data structure. The LFSR-based (Linear Feedback Shift Register) nature of the PRNG in our implementation perfectly suits the GPU's many-core structure due to its register oriented architecture and allows the usage of bit slicing technique to further improve the performance. In our SIMD vectorized fully parallel GPU implementation, each GPU thread is capable of generating a remarkable number of 32 pseudo-random bits in each LFSR clock cycle. We then compare our implementation with some of the most significant PRNGs that display a satisfactory performance in both throughput and randomness criteria. The proposed implementation successfully passes the NIST test for statistical randomness and bit-wise correlation criteria. To the best of authors' best knowledge, our method outperforms the current best implementations in the literature for computer-based PRNG and the optical solutions in terms of performance and performance per cost, while maintaining an acceptable measure of randomness. Our highest performance among all of the implemented CPRNGs with the proposed method is achieved by the MICKEY 2.0 algorithm which shows 1.9x improvement over the state of the art NVIDIA's proprietary high-performance PRNG, cuRAND library, achieving 1.6 Tb/s of throughput on the affordable NVIDIA GTX 980 Ti. △ Less

Submitted 20 October, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: 10 pages

Showing 1–8 of 8 results for author: Rahmati, D