Search | arXiv e-print repository

Reactor Optimization Benchmark by Reinforcement Learning

Authors: Deborah Schwarcz, Nadav Schneider, Gal Oren, Uri Steinitz

Abstract: Neutronic calculations for reactors are a daunting task when using Monte Carlo (MC) methods. As high-performance computing has advanced, the simulation of a reactor is nowadays more readily done, but design and optimization with multiple parameters is still a computational challenge. MC transport simulations, coupled with machine learning techniques, offer promising avenues for enhancing the effic… ▽ More Neutronic calculations for reactors are a daunting task when using Monte Carlo (MC) methods. As high-performance computing has advanced, the simulation of a reactor is nowadays more readily done, but design and optimization with multiple parameters is still a computational challenge. MC transport simulations, coupled with machine learning techniques, offer promising avenues for enhancing the efficiency and effectiveness of nuclear reactor optimization. This paper introduces a novel benchmark problem within the OpenNeoMC framework designed specifically for reinforcement learning. The benchmark involves optimizing a unit cell of a research reactor with two varying parameters (fuel density and water spacing) to maximize neutron flux while maintaining reactor criticality. The test case features distinct local optima, representing different physical regimes, thus posing a challenge for learning algorithms. Through extensive simulations utilizing evolutionary and neuroevolutionary algorithms, we demonstrate the effectiveness of reinforcement learning in navigating complex optimization landscapes with strict constraints. Furthermore, we propose acceleration techniques within the OpenNeoMC framework, including model updating and cross-section usage by RAM utilization, to expedite simulation times. Our findings emphasize the importance of machine learning integration in reactor optimization and contribute to advancing methodologies for addressing intricate optimization challenges in nuclear engineering. The sources of this work are available at our GitHub repository: https://github.com/Scientific-Computing-Lab-NRCN/RLOpenNeoMC △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.02735

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators

Authors: Yehonatan Fridman, Guy Tamir, Uri Steinitz, Gal Oren

Abstract: Monte Carlo (MC) simulations play a pivotal role in diverse scientific and engineering domains, with applications ranging from nuclear physics to materials science. Harnessing the computational power of high-performance computing (HPC) systems, especially Graphics Processing Units (GPUs), has become essential for accelerating MC simulations. This paper focuses on the adaptation and optimization of… ▽ More Monte Carlo (MC) simulations play a pivotal role in diverse scientific and engineering domains, with applications ranging from nuclear physics to materials science. Harnessing the computational power of high-performance computing (HPC) systems, especially Graphics Processing Units (GPUs), has become essential for accelerating MC simulations. This paper focuses on the adaptation and optimization of the OpenMC neutron and photon transport Monte Carlo code for Intel GPUs, specifically the Intel Data Center Max 1100 GPU (codename Ponte Vecchio, PVC), through distributed OpenMP offloading. Building upon prior work by Tramm J.R., et al. (2022), which laid the groundwork for GPU adaptation, our study meticulously extends the OpenMC code's capabilities to Intel GPUs. We present a comprehensive benchmarking and scaling analysis, comparing performance on Intel MAX GPUs to state-of-the-art CPU execution (Intel Xeon Platinum 8480+ Processor, codename 4th generation Sapphire Rapids). The results demonstrate a remarkable acceleration factor compared to CPU execution, showcasing the GPU-adapted code's superiority over its CPU counterpart as computational load increases. △ Less

Submitted 12 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: Further improvements are needed to gain a significant value for this paper. Unfortunately, we would not be able to submit a replacement in the following months

arXiv:2402.09126 [pdf, other]

MPIrigen: MPI Code Generation through Domain-Specific Language Models

Authors: Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

Abstract: The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generati… ▽ More The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen △ Less

Submitted 23 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.02018 [pdf, other]

The Landscape and Challenges of HPC Research and LLMs

Authors: Le Chen, Nesreen K. Ahmed, Akash Dutta, Arijit Bhattacharjee, Sixing Yu, Quazi Ishtiaque Mahmud, Waqwoya Abebe, Hung Phan, Aishwarya Sarkar, Branden Butler, Niranjan Hasabnis, Gal Oren, Vy A. Vo, Juan Pablo Munoz, Theodore L. Willke, Tim Mattson, Ali Jannesari

Abstract: Recently, language models (LMs), especially large language models (LLMs), have revolutionized the field of deep learning. Both encoder-decoder models and prompt-based techniques have shown immense potential for natural language processing and code-based tasks. Over the past several years, many research labs and institutions have invested heavily in high-performance computing, approaching or breach… ▽ More Recently, language models (LMs), especially large language models (LLMs), have revolutionized the field of deep learning. Both encoder-decoder models and prompt-based techniques have shown immense potential for natural language processing and code-based tasks. Over the past several years, many research labs and institutions have invested heavily in high-performance computing, approaching or breaching exascale performance levels. In this paper, we posit that adapting and utilizing such language model-based techniques for tasks in high-performance computing (HPC) would be very beneficial. This study presents our reasoning behind the aforementioned position and highlights how existing ideas can be improved and adapted for HPC tasks. △ Less

Submitted 6 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.16445 [pdf, other]

OMPGPT: A Generative Pre-trained Transformer Model for OpenMP

Authors: Le Chen, Arijit Bhattacharjee, Nesreen Ahmed, Niranjan Hasabnis, Gal Oren, Vy Vo, Ali Jannesari

Abstract: Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). This trend led to the development of code-based large language models such as StarCoder, WizardCoder, and CodeLlama, which are trained extensively on vast repositories of code and programming languages. While the generic abilities of these code LLMs are useful for many programmer… ▽ More Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). This trend led to the development of code-based large language models such as StarCoder, WizardCoder, and CodeLlama, which are trained extensively on vast repositories of code and programming languages. While the generic abilities of these code LLMs are useful for many programmers in tasks like code generation, the area of high-performance computing (HPC) has a narrower set of requirements that make a smaller and more domain-specific model a smarter choice. This paper presents OMPGPT, a novel domain-specific model meticulously designed to harness the inherent strengths of language models for OpenMP pragma generation. Furthermore, we leverage prompt engineering techniques from the NLP domain to create Chain-of-OMP, an innovative strategy designed to enhance OMPGPT's effectiveness. Our extensive evaluations demonstrate that OMPGPT outperforms existing large language models specialized in OpenMP tasks and maintains a notably smaller size, aligning it more closely with the typical hardware constraints of HPC environments. We consider our contribution as a pivotal bridge, connecting the advantage of language models with the specific demands of HPC tasks. △ Less

Submitted 21 June, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2312.13322 [pdf, other]

Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because these LLMs for HPC tasks are obtained by… ▽ More With easier access to powerful compute resources, there is a growing trend in AI for software development to develop larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because these LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need large LMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by develo** smaller LMs for specific domains - we call them domain-specific LMs. Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against conventional multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, achieves similar results on normalized-perplexity tests and much better ones in CodeBLEU competence for high-performance and parallel code generations. Furthermore, fine-tuning the base model for the specific task of parallel code generation (OpenMP parallel for pragmas) demonstrates outstanding results compared to GPT, especially when local misleading semantics are removed by our novel pre-processor Tokompiler, showcasing the ability of domain-specific models to assist in HPC-relevant tasks. △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2311.06505 [pdf, other]

CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset

Authors: Le Chen, Arijit Bhattacharjee, Nesreen K. Ahmed, Niranjan Hasabnis, Gal Oren, Bin Lei, Ali Jannesari

Abstract: Large language models (LLMs) have become increasingly prominent in academia and industry due to their remarkable performance in diverse applications. As these models evolve with increasing parameters, they excel in tasks like sentiment analysis and machine translation. However, even models with billions of parameters face challenges in tasks demanding multi-step reasoning. Code generation and comp… ▽ More Large language models (LLMs) have become increasingly prominent in academia and industry due to their remarkable performance in diverse applications. As these models evolve with increasing parameters, they excel in tasks like sentiment analysis and machine translation. However, even models with billions of parameters face challenges in tasks demanding multi-step reasoning. Code generation and comprehension, especially in C and C++, emerge as significant challenges. While LLMs trained on code datasets demonstrate competence in many tasks, they struggle with rectifying non-compilable C and C++ code. Our investigation attributes this subpar performance to two primary factors: the quality of the training dataset and the inherent complexity of the problem which demands intricate reasoning. Existing "Chain of Thought" (CoT) prompting techniques aim to enhance multi-step reasoning. This approach, however, retains the limitations associated with the latent drawbacks of LLMs. In this work, we propose CompCodeVet, a compiler-guided CoT approach to produce compilable code from non-compilable ones. Diverging from the conventional approach of utilizing larger LLMs, we employ compilers as a teacher to establish a more robust zero-shot thought process. The evaluation of CompCodeVet on two open-source code datasets shows that CompCodeVet has the ability to improve the training dataset quality for LLMs. △ Less

Submitted 11 November, 2023; originally announced November 2023.

arXiv:2308.10714 [pdf, other]

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Authors: Yehonatan Fridman, Suprasad Mutalik Desai, Navneet Singh, Thomas Willhalm, Gal Oren

Abstract: In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. This paper presents a comprehensive exploration of CXL memory's viability as a candidat… ▽ More In the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. This paper presents a comprehensive exploration of CXL memory's viability as a candidate for PMem, supported by physical experiments conducted on cutting-edge multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not only benchmarks the performance of CXL memory but also illustrates the seamless transition from traditional PMem programming models to CXL, reinforcing its practicality. To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs (Intel FPGA CXL IP). Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory's ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics with Intel 4th generation Xeon (Sapphire Rapids) processors. The results elucidate the feasibility of CXL memory as a persistent memory solution, outperforming previously established benchmarks. In contrast to published DCPMM results, our CXL-DDR4 memory module offers comparable bandwidth to local DDR4 memory configurations, albeit with a moderate decrease in performance. The modified STREAM-PMem application underscores the ease of transitioning programming models from PMem to CXL, thus underscoring the practicality of adopting CXL memory. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: 12 pages, 9 figures

arXiv:2308.09440 [pdf, other]

Scope is all you need: Transforming LLMs for HPC Code

Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

Abstract: With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found… ▽ More With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by develo** smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks. △ Less

Submitted 29 September, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.08206 [pdf, other]

Explainable Multi-View Deep Networks Methodology for Experimental Physics

Authors: Nadav Schneider, Muriel Tzdaka, Galit Sturm, Guy Lazovski, Galit Bar, Gilad Oren, Raz Gvishi, Gal Oren

Abstract: Physical experiments often involve multiple imaging representations, such as X-ray scans and microscopic images. Deep learning models have been widely used for supervised analysis in these experiments. Combining different image representations is frequently required to analyze and make a decision properly. Consequently, multi-view data has emerged - datasets where each sample is described by views… ▽ More Physical experiments often involve multiple imaging representations, such as X-ray scans and microscopic images. Deep learning models have been widely used for supervised analysis in these experiments. Combining different image representations is frequently required to analyze and make a decision properly. Consequently, multi-view data has emerged - datasets where each sample is described by views from different angles, sources, or modalities. These problems are addressed with the concept of multi-view learning. Understanding the decision-making process of deep learning models is essential for reliable and credible analysis. Hence, many explainability methods have been devised recently. Nonetheless, there is a lack of proper explainability in multi-view models, which are challenging to explain due to their architectures. In this paper, we suggest different multi-view architectures for the vision domain, each suited to another problem, and we also present a methodology for explaining these models. To demonstrate the effectiveness of our methodology, we focus on the domain of High Energy Density Physics (HEDP) experiments, where multiple imaging representations are used to assess the quality of foam samples. We apply our methodology to classify the foam samples quality using the suggested multi-view architectures. Through experimental results, we showcase the improvement of accurate architecture choice on both accuracy - 78% to 84% and AUC - 83% to 93% and present a trade-off between performance and explainability. Specifically, we demonstrate that our approach enables the explanation of individual one-view models, providing insights into the decision-making process of each view. This understanding enhances the interpretability of the overall multi-view model. The sources of this work are available at: https://github.com/Scientific-Computing-Lab-NRCN/Multi-View-Explainability. △ Less

Submitted 17 August, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

arXiv:2308.08002 [pdf, ps, other]

Quantifying OpenMP: Statistical Insights into Usage and Adoption

Authors: Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Abstract: In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of c… ▽ More In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of comprehensive data on the actual usage of OpenMP constructs, hindering unbiased insights into its popularity and evolution. This paper presents a statistical analysis of OpenMP usage and adoption trends based on a novel and extensive database, HPCORPUS, compiled from GitHub repositories containing C, C++, and Fortran code. The results reveal that OpenMP is the dominant parallel programming model, accounting for 45% of all analyzed parallel APIs. Furthermore, it has demonstrated steady and continuous growth in popularity over the past decade. Analyzing specific OpenMP constructs, the study provides in-depth insights into their usage patterns and preferences across the three languages. Notably, we found that while OpenMP has a strong "common core" of constructs in common usage (while the rest of the API is less used), there are new adoption trends as well, such as simd and target directives for accelerated computing and task for irregular parallelism. Overall, this study sheds light on OpenMP's significance in HPC applications and provides valuable data for researchers and practitioners. It showcases OpenMP's versatility, evolving adoption, and relevance in contemporary parallel programming, underlining its continued role in HPC applications and beyond. These statistical insights are essential for making informed decisions about parallelization strategies and provide a foundation for further advancements in parallel programming models and techniques. △ Less

Submitted 17 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

arXiv:2305.11999 [pdf, other]

Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

Authors: Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Abstract: There is an ever-present need for shared memory parallelization schemes to exploit the full potential of multi-core architectures. The most common parallelization API addressing this need today is OpenMP. Nevertheless, writing parallel code manually is complex and effort-intensive. Thus, many deterministic source-to-source (S2S) compilers have emerged, intending to automate the process of translat… ▽ More There is an ever-present need for shared memory parallelization schemes to exploit the full potential of multi-core architectures. The most common parallelization API addressing this need today is OpenMP. Nevertheless, writing parallel code manually is complex and effort-intensive. Thus, many deterministic source-to-source (S2S) compilers have emerged, intending to automate the process of translating serial to parallel code. However, recent studies have shown that these compilers are impractical in many scenarios. In this work, we combine the latest advancements in the field of AI and natural language processing (NLP) with the vast amount of open-source code to address the problem of automatic parallelization. Specifically, we propose a novel approach, called OMPify, to detect and predict the OpenMP pragmas and shared-memory attributes in parallel code, given its serial version. OMPify is based on a Transformer-based model that leverages a graph-based representation of source code that exploits the inherent structure of code. We evaluated our tool by predicting the parallelization pragmas and attributes of a large corpus of (over 54,000) snippets of serial code written in C and C++ languages (Open-OMP-Plus). Our results demonstrate that OMPify outperforms existing approaches, the general-purposed and popular ChatGPT and targeted PragFormer models, in terms of F1 score and accuracy. Specifically, OMPify achieves up to 90% accuracy on commonly-used OpenMP benchmark tests such as NAS, SPEC, and PolyBench. Additionally, we performed an ablation study to assess the impact of different model components and present interesting insights derived from the study. Lastly, we also explored the potential of using data augmentation and curriculum learning techniques to improve the model's robustness and generalization capabilities. △ Less

Submitted 16 May, 2023; originally announced May 2023.

arXiv:2305.09438 [pdf, other]

MPI-rical: Data-Driven MPI Distributed Parallelism Assistance with Transformers

Authors: Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

Abstract: Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by develo** MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain d… ▽ More Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by develo** MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain decomposition based distributed memory parallelization code. Specifically, we train a supervised language model to suggest MPI functions and their proper locations in the code on the fly. We also introduce MPICodeCorpus, the first publicly available corpus of MPI-based parallel programs that is created by mining more than 15,000 open-source repositories on GitHub. Experimental results have been done on MPICodeCorpus and more importantly, on a compiled benchmark of MPI-based parallel programs for numerical computations that represent real-world scientific applications. MPI-RICAL achieves F1 scores between 0.87-0.91 on these programs, demonstrating its accuracy in suggesting correct MPI functions at appropriate code locations.. The source code used in this work, as well as other relevant sources, are available at: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rical △ Less

Submitted 30 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

arXiv:2304.04276 [pdf, other]

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Authors: Yehonatan Fridman, Guy Tamir, Gal Oren

Abstract: Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications,… ▽ More Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (400^3) scalably due to the larger memory size. △ Less

Submitted 14 May, 2023; v1 submitted 9 April, 2023; originally announced April 2023.

Comments: 13 pages, 5 Figures, 5 Tables

arXiv:2209.01983 [pdf, other]

ScalSALE: Scalable SALE Benchmark Framework for Supercomputers

Authors: Re'em Harel, Matan Rusanovsky, Ron Wagner, Harel Levin, Gal Oren

Abstract: Supercomputers worldwide provide the necessary infrastructure for groundbreaking research. However, most supercomputers are not designed equally due to different desired figure of merit, which is derived from the computational bounds of the targeted scientific applications' portfolio. In turn, the design of such computers becomes an optimization process that strives to achieve the best performance… ▽ More Supercomputers worldwide provide the necessary infrastructure for groundbreaking research. However, most supercomputers are not designed equally due to different desired figure of merit, which is derived from the computational bounds of the targeted scientific applications' portfolio. In turn, the design of such computers becomes an optimization process that strives to achieve the best performances possible in a multi-parameters search space. Therefore, verifying and evaluating whether a supercomputer can achieve its desired goal becomes a tedious and complex task. For this purpose, many full, mini, proxy, and benchmark applications have been introduced in the attempt to represent scientific applications. Nevertheless, as these benchmarks are hard to expand, and most importantly, are over-simplified compared to scientific applications that tend to couple multiple scientific domains, they fail to represent the true scaling capabilities. We suggest a new physical scalable benchmark framework, namely ScalSALE, based on the well-known SALE scheme. ScalSALE's main goal is to provide a simple, flexible, scalable infrastructure that can be easily expanded to include multi-physical schemes while maintaining scalable and efficient execution times. By expanding ScalSALE, the gap between the over-simplified benchmarks and scientific applications can be bridged. To achieve this goal, ScalSALE is implemented in Modern Fortran with simple OOP design patterns and supported by transparent MPI-3 blocking and non-blocking communication that allows such a scalable framework. ScalSALE is compared to LULESH via simulating the Sedov-Taylor blast wave problem using strong and weak scaling tests. ScalSALE is executed and evaluated with both rezoning options - Lagrangian and Eulerian. △ Less

Submitted 5 September, 2022; originally announced September 2022.

arXiv:2208.07196 [pdf, other]

Determining HEDP Foams' Quality with Multi-View Deep Learning Classification

Authors: Nadav Schneider, Matan Rusanovsky, Raz Gvishi, Gal Oren

Abstract: High energy density physics (HEDP) experiments commonly involve a dynamic wave-front propagating inside a low-density foam. This effect affects its density and hence, its transparency. A common problem in foam production is the creation of defective foams. Accurate information on their dimension and homogeneity is required to classify the foams' quality. Therefore, those parameters are being chara… ▽ More High energy density physics (HEDP) experiments commonly involve a dynamic wave-front propagating inside a low-density foam. This effect affects its density and hence, its transparency. A common problem in foam production is the creation of defective foams. Accurate information on their dimension and homogeneity is required to classify the foams' quality. Therefore, those parameters are being characterized using a 3D-measuring laser confocal microscope. For each foam, five images are taken: two 2D images representing the top and bottom surface foam planes and three images of side cross-sections from 3D scannings. An expert has to do the complicated, harsh, and exhausting work of manually classifying the foam's quality through the image set and only then determine whether the foam can be used in experiments or not. Currently, quality has two binary levels of normal vs. defective. At the same time, experts are commonly required to classify a sub-class of normal-defective, i.e., foams that are defective but might be sufficient for the needed experiment. This sub-class is problematic due to inconclusive judgment that is primarily intuitive. In this work, we present a novel state-of-the-art multi-view deep learning classification model that mimics the physicist's perspective by automatically determining the foams' quality classification and thus aids the expert. Our model achieved 86\% accuracy on upper and lower surface foam planes and 82\% on the entire set, suggesting interesting heuristics to the problem. A significant added value in this work is the ability to regress the foam quality instead of binary deduction and even explain the decision visually. The source code used in this work, as well as other relevant sources, are available at: https://github.com/Scientific-Computing-Lab-NRCN/Multi-View-Foams.git △ Less

Submitted 10 August, 2022; originally announced August 2022.

arXiv:2208.02240 [pdf, other]

The Case for Non-Volatile RAM in Cloud HPCaaS

Authors: Yehonatan Fridman, Re'em Harel, Gal Oren

Abstract: HPC as a service (HPCaaS) is a new way to expose HPC resources via cloud services. However, continued effort to port large-scale tightly coupled applications with high interprocessor communication to multiple (and many) nodes synchronously, as in on-premise supercomputers, is still far from satisfactory due to network latencies. As a consequence, in said cases, HPCaaS is recommended to be used wit… ▽ More HPC as a service (HPCaaS) is a new way to expose HPC resources via cloud services. However, continued effort to port large-scale tightly coupled applications with high interprocessor communication to multiple (and many) nodes synchronously, as in on-premise supercomputers, is still far from satisfactory due to network latencies. As a consequence, in said cases, HPCaaS is recommended to be used with one or few instances. In this paper we take the claim that new piece of memory hardware, namely Non-Volatile RAM (NVRAM), can allow such computations to scale up to an order of magnitude with marginalized penalty in comparison to RAM. Moreover, we suggest that the introduction of NVRAM to HPCaaS can be cost-effective to the users and the suppliers in numerous forms. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: 4 pages

arXiv:2204.12835 [pdf, other]

Learning to Parallelize in a Shared-Memory Environment with Transformers

Authors: Re'em Harel, Yuval Pinter, Gal Oren

Abstract: In past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications. OpenMP is the most comprehensive API that implements such schemes, characterized by a readable interface. Nevertheless, introducing OpenMP into code is challe… ▽ More In past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications. OpenMP is the most comprehensive API that implements such schemes, characterized by a readable interface. Nevertheless, introducing OpenMP into code is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in ML techniques, specifically in natural language processing (NLP), to replace S2S compilers altogether. We create a database (corpus), Open-OMP, specifically for this goal. Open-OMP contains over 28,000 code snippets, half of which contain OpenMP directives while the other half do not need parallelization at all with high probability. We use the corpus to train systems to automatically classify code segments in need of parallelization, as well as suggest individual OpenMP clauses. We train several transformer models, named PragFormer, for these tasks, and show that they outperform statistically-trained baselines and automatic S2S parallelization compilers in both classifying the overall need for an OpenMP directive and the introduction of private and reduction clauses. Our source code and database are available at: https://github.com/Scientific-Computing-Lab-NRCN/PragFormer. △ Less

Submitted 14 July, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

arXiv:2204.11584 [pdf, other]

Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM

Authors: Yehonatan Fridman, Yaniv Snir, Harel Levin, Danny Hendler, Hagit Attiya, Gal Oren

Abstract: HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes. Mathematical solvers, i… ▽ More HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes. Mathematical solvers, in particular, iterative linear solvers are key building block in numerous large-scale scientific applications. Consequently, supporting the recovery of distributed solvers is necessary for scaling scientific applications to exascale platforms. Previous recovery methods for iterative solvers are based on Checkpoint-Restart (CR), which incurs high fault tolerance overhead, or intrinsic fault tolerance, which require extra computation time to converge after failures. Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of the computation state while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads. Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience. Our mechanism, called in-NVRAM ESR, is based on a novel MPI One-Sided Communication (OSC) over RDMA implementation, and provides full resiliency while significantly reducing both the memory footprint and the time overhead in comparison with the original ESR design (in-RAM ESR). △ Less

Submitted 9 August, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: 10 pages, 10 figures

arXiv:2109.05746 [pdf, other]

ChangeChip: A Reference-Based Unsupervised Change Detection for PCB Defect Detection

Authors: Yehonatan Fridman, Matan Rusanovsky, Gal Oren

Abstract: The usage of electronic devices increases, and becomes predominant in most aspects of life. Surface Mount Technology (SMT) is the most common industrial method for manufacturing electric devices in which electrical components are mounted directly onto the surface of a Printed Circuit Board (PCB). Although the expansion of electronic devices affects our lives in a productive way, failures or defect… ▽ More The usage of electronic devices increases, and becomes predominant in most aspects of life. Surface Mount Technology (SMT) is the most common industrial method for manufacturing electric devices in which electrical components are mounted directly onto the surface of a Printed Circuit Board (PCB). Although the expansion of electronic devices affects our lives in a productive way, failures or defects in the manufacturing procedure of those devices might also be counterproductive and even harmful in some cases. It is therefore desired and sometimes crucial to ensure zero-defect quality in electronic devices and their production. While traditional Image Processing (IP) techniques are not sufficient to produce a complete solution, other promising methods like Deep Learning (DL) might also be challenging for PCB inspection, mainly because such methods require big adequate datasets which are missing, not available or not updated in the rapidly growing field of PCBs. Thus, PCB inspection is conventionally performed manually by human experts. Unsupervised Learning (UL) methods may potentially be suitable for PCB inspection, having learning capabilities on the one hand, while not relying on large datasets on the other. In this paper, we introduce ChangeChip, an automated and integrated change detection system for defect detection in PCBs, from soldering defects to missing or misaligned electronic elements, based on Computer Vision (CV) and UL. We achieve good quality defect detection by applying an unsupervised change detection between images of a golden PCB (reference) and the inspected PCB under various setting. In this work, we also present CD-PCB, a synthesized labeled dataset of 20 pairs of PCB images for evaluation of defect detection algorithms. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: 8 pages, 5 figures, " The sources of ChangeChip, as well as CD-PCB, are available at: https://github.com/Scientific-Computing-Lab-NRCN/ChangeChip "

arXiv:2109.03122 [pdf, ps, other]

Lax Functors, Cospans, and the Center Construction

Authors: Ryan E. Grady, Garrett Oren

Abstract: The center construction is not (classically) functorial. In this note, we specialize a universal construction of Jacob Lurie to the category of rings and upgrade the classical center to a lax functor. In particular, we find lax functors to the Morita category and the category of cospans. The center construction is not (classically) functorial. In this note, we specialize a universal construction of Jacob Lurie to the category of rings and upgrade the classical center to a lax functor. In particular, we find lax functors to the Morita category and the category of cospans. △ Less

Submitted 19 November, 2021; v1 submitted 7 September, 2021; originally announced September 2021.

Comments: V3: Final, title change (to be more descriptive)

Journal ref: Grad. J. Math. 6 (2021), no. 2, 1-8

arXiv:2109.02166 [pdf, other]

Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing

Authors: Yehonatan Fridman, Yaniv Snir, Matan Rusanovsky, Kfir Zvi, Harel Levin, Danny Hendler, Hagit Attiya, Gal Oren

Abstract: As the High Performance Computing world moves towards the Exa-Scale era, huge amounts of data should be analyzed, manipulated and stored. In the traditional storage/memory hierarchy, each compute node retains its data objects in its local volatile DRAM. Whenever the DRAM's capacity becomes insufficient for storing this data, the computation should either be distributed between several compute node… ▽ More As the High Performance Computing world moves towards the Exa-Scale era, huge amounts of data should be analyzed, manipulated and stored. In the traditional storage/memory hierarchy, each compute node retains its data objects in its local volatile DRAM. Whenever the DRAM's capacity becomes insufficient for storing this data, the computation should either be distributed between several compute nodes, or some portion of these data objects must be stored in a non-volatile block device such as a hard disk drive or an SSD storage device. Optane DataCenter Persistent Memory Module (DCPMM), a new technology introduced by Intel, provides non-volatile memory that can be plugged into standard memory bus slots and therefore be accessed much faster than standard storage devices. In this work, we present and analyze the results of a comprehensive performance assessment of several ways in which DCPMM can 1) replace standard storage devices, and 2) replace or augment DRAM for improving the performance of HPC scientific computations. To achieve this goal, we have configured an HPC system such that DCPMM can service I/O operations of scientific applications, replace standard storage devices and file systems (specifically for diagnostics and checkpoint-restarting), and serve for expanding applications' main memory. We focus on kee** the scientific codes with as few changes as possible, while allowing them to access the NVM transparently as if they access persistent storage. Our results show that DCPMM allows scientific applications to fully utilize nodes' locality by providing them with sufficiently-large main memory. Moreover, it can be used for providing a high-performance replacement for persistent storage. Thus, the usage of DCPMM has the potential of replacing standard HDD and SSD storage devices in HPC architectures and enabling a more efficient platform for modern supercomputing applications. △ Less

Submitted 5 September, 2021; originally announced September 2021.

Comments: 10 pages, 6 figures, The source code used by this work, as well as the benchmarks and other relevant sources, are available at: https://github.com/Scientific-Computing-Lab-NRCN/StoringStorage

arXiv:2107.12304 [pdf, other]

In Defense of the Learning Without Forgetting for Task Incremental Learning

Authors: Guy Oren, Lior Wolf

Abstract: Catastrophic forgetting is one of the major challenges on the road for continual learning systems, which are presented with an on-line stream of tasks. The field has attracted considerable interest and a diverse set of methods have been presented for overcoming this challenge. Learning without Forgetting (LwF) is one of the earliest and most frequently cited methods. It has the advantages of not r… ▽ More Catastrophic forgetting is one of the major challenges on the road for continual learning systems, which are presented with an on-line stream of tasks. The field has attracted considerable interest and a diverse set of methods have been presented for overcoming this challenge. Learning without Forgetting (LwF) is one of the earliest and most frequently cited methods. It has the advantages of not requiring the storage of samples from the previous tasks, of implementation simplicity, and of being well-grounded by relying on knowledge distillation. However, the prevailing view is that while it shows a relatively small amount of forgetting when only two tasks are introduced, it fails to scale to long sequences of tasks. This paper challenges this view, by showing that using the right architecture along with a standard set of augmentations, the results obtained by LwF surpass the latest algorithms for task incremental scenario. This improved performance is demonstrated by an extensive set of experiments over CIFAR-100 and Tiny-ImageNet, where it is also shown that other methods cannot benefit as much from similar improvements. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: 12 pages with 4 figures

arXiv:2104.11159 [pdf, other]

An End-to-End Computer Vision Methodology for Quantitative Metallography

Authors: Matan Rusanovsky, Ofer Beeri, Gal Oren

Abstract: Metallography is crucial for a proper assessment of material's properties. It involves mainly the investigation of spatial distribution of grains and the occurrence and characteristics of inclusions or precipitates. This work presents an holistic artificial intelligence model for Anomaly Detection that automatically quantifies the degree of anomaly of impurities in alloys. We suggest the following… ▽ More Metallography is crucial for a proper assessment of material's properties. It involves mainly the investigation of spatial distribution of grains and the occurrence and characteristics of inclusions or precipitates. This work presents an holistic artificial intelligence model for Anomaly Detection that automatically quantifies the degree of anomaly of impurities in alloys. We suggest the following examination process: (1) Deep semantic segmentation is performed on the inclusions (based on a suitable metallographic database of alloys and corresponding tags of inclusions), producing inclusions masks that are saved into a separated database. (2) Deep image inpainting is performed to fill the removed inclusions parts, resulting in 'clean' metallographic images, which contain the background of grains. (3) Grains' boundaries are marked using deep semantic segmentation (based on another metallographic database of alloys), producing boundaries that are ready for further inspection on the distribution of grains' size. (4) Deep anomaly detection and pattern recognition is performed on the inclusions masks to determine spatial, shape and area anomaly detection of the inclusions. Finally, the system recommends to an expert on areas of interests for further examination. The performance of the model is presented and analyzed based on few representative cases. Although the models presented here were developed for metallography analysis, most of them can be generalized to a wider set of problems in which anomaly detection of geometrical objects is desired. All models as well as the data-sets that were created for this work, are publicly available at https://github.com/Scientific-Computing-Lab-NRCN/MLography. △ Less

Submitted 1 March, 2022; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2003.04226 Same text as last submission, changed the author list to correspond to the pdf

arXiv:2102.12953 [pdf, ps, other]

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Authors: Kfir Zvi, Gal Oren

Abstract: Common resource management methods in supercomputing systems usually include hard divisions, cap**, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of an expensive facility, and occasionally there is still a need to dynamically reschedule and reallocate the resources. Consequently, those methods involve bad s… ▽ More Common resource management methods in supercomputing systems usually include hard divisions, cap**, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of an expensive facility, and occasionally there is still a need to dynamically reschedule and reallocate the resources. Consequently, those methods involve bad supply-and-demand management rather than a free market playground that will eventually increase system utilization and productivity. In this work, we propose the newly Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption, in which the social welfare increases using a free-of-cost interchangeable proprietary possession scheme. Accordingly, we permanently keep the status-quo in regard to the fairness of the resources distribution while maximizing the ability of all users to achieve more CPUs and CPU hours for longer period without any non-straightforward costs, penalties or additional human intervention. △ Less

Submitted 25 February, 2021; originally announced February 2021.

Comments: 4 pages

arXiv:2005.13304 [pdf, other]

ComPar: Optimized Multi-Compiler for Automatic OpenMP S2S Parallelization

Authors: Idan Mosseri, Lee-or Alon, Re'em Harel, Gal Oren

Abstract: Parallelization schemes are essential in order to exploit the full benefits of multi-core architectures. In said architectures, the most comprehensive parallelization API is OpenMP. However, the introduction of correct and optimal OpenMP parallelization to applications is not always a simple task, due to common parallel management pitfalls, architecture heterogeneity and the current necessity for… ▽ More Parallelization schemes are essential in order to exploit the full benefits of multi-core architectures. In said architectures, the most comprehensive parallelization API is OpenMP. However, the introduction of correct and optimal OpenMP parallelization to applications is not always a simple task, due to common parallel management pitfalls, architecture heterogeneity and the current necessity for human expertise in order to comprehend many fine details and abstract correlations. To ease this process, many automatic parallelization compilers were created over the last decade. Harel et al. [2020] tested several source-to-source compilers and concluded that each has its advantages and disadvantages and no compiler is superior to all other compilers in all tests. This indicates that a fusion of the compilers' best outputs under the best hyper-parameters for the current hardware setups can yield greater speedups. To create such a fusion, one should execute a computationally intensive hyper-parameter sweep, in which the performance of each option is estimated and the best option is chosen. We created a novel parallelization source-to-source multi-compiler named ComPar, which uses code segmentation-and-fusion with hyper-parameters tuning to achieve the best parallel code possible without any human intervention while maintaining the program's validity. In this paper we present ComPar and analyze its results on NAS and PolyBench benchmarks. We conclude that although the resources ComPar requires to produce parallel code are greater than other source-to-source parallelization compilers - as it depends on the number of parameters the user wishes to consider, and their combinations - ComPar achieves superior performance overall compared to the serial code version and other tested parallelization compilers. ComPar is publicly available at: https://github.com/Scientific-Computing-Lab-NRCN/compar. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 15 pages

arXiv:2005.04198 [pdf, ps, other]

Distributed K-Backup Placement and Applications to Virtual Memory in Real-World Wireless Networks

Authors: Gal Oren, Leonid Barenboim

Abstract: The Backup Placement problem in networks in the $\mathcal{CONGEST}$ distributed setting considers a network graph $G = (V,E)$, in which the goal of each vertex $v \in V$ is selecting a neighbor, such that the maximum number of vertices in $V$ that select the same vertex is minimized [Halldorsson et al., 2015]. Previous backup placement algorithms suffer from obliviousness to main factors of real-w… ▽ More The Backup Placement problem in networks in the $\mathcal{CONGEST}$ distributed setting considers a network graph $G = (V,E)$, in which the goal of each vertex $v \in V$ is selecting a neighbor, such that the maximum number of vertices in $V$ that select the same vertex is minimized [Halldorsson et al., 2015]. Previous backup placement algorithms suffer from obliviousness to main factors of real-world heterogeneous wireless network. Specifically, there is no consideration of the nodes memory and storage capacities, and no reference to a case in which nodes have different energy capacity, and thus can leave (or join) the network at any time. These parameters are strongly correlated in wireless networks, as the load on different parts of the network can differ greatly, thus requiring more communication, energy, memory and storage. In order to fit the real-world attributes of wireless networks, this work addresses a generalized version of the original problem, namely $K$-Backup Placement, in which each vertex selects $K$ neighbors, for a positive parameter $K$. Our $K$-Backup Placement algorithm terminates within just one round. In addition we suggest two complementary algorithms which employ $K$-Backup-Placement to obtain efficient virtual memory schemes for wireless networks. The first algorithm divides the memory of each node to many small parts. Each vertex is assigned the memories of a large subset of its neighbors. Thus more memory capacity for more vertices is gained, but with much fragmentation. The second algorithm requires greater round-complexity, but produces larger virtual memory for each vertex without any fragmentation. △ Less

Submitted 20 June, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

Comments: 14 pages

arXiv:2004.03374 [pdf, other]

Complete CVDL Methodology for Investigating Hydrodynamic Instabilities

Authors: Re'em Harel, Matan Rusanovsky, Yehonatan Fridman, Assaf Shimony, Gal Oren

Abstract: In fluid dynamics, one of the most important research fields is hydrodynamic instabilities and their evolution in different flow regimes. The investigation of said instabilities is concerned with the highly non-linear dynamics. Currently, three main methods are used for understanding of such phenomenon - namely analytical models, experiments and simulations - and all of them are primarily investig… ▽ More In fluid dynamics, one of the most important research fields is hydrodynamic instabilities and their evolution in different flow regimes. The investigation of said instabilities is concerned with the highly non-linear dynamics. Currently, three main methods are used for understanding of such phenomenon - namely analytical models, experiments and simulations - and all of them are primarily investigated and correlated using human expertise. In this work we claim and demonstrate that a major portion of this research effort could and should be analysed using recent breakthrough advancements in the field of Computer Vision with Deep Learning (CVDL, or Deep Computer-Vision). Specifically, we target and evaluate specific state-of-the-art techniques - such as Image Retrieval, Template Matching, Parameters Regression and Spatiotemporal Prediction - for the quantitative and qualitative benefits they provide. In order to do so we focus in this research on one of the most representative instabilities, the Rayleigh-Taylor one, simulate its behaviour and create an open-sourced state-of-the-art annotated database (RayleAI). Finally, we use adjusted experimental results and novel physical loss methodologies to validate the correspondence of the predicted results to actual physical reality to prove the models efficiency. The techniques which were developed and proved in this work can be served as essential tools for physicists in the field of hydrodynamics for investigating a variety of physical systems, and also could be used via Transfer Learning to other instabilities research. A part of the techniques can be easily applied on already exist simulation results. All models as well as the data-set that was created for this work, are publicly available at: https://github.com/scientific-computing-nrcn/SimulAI. △ Less

Submitted 26 April, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

Comments: 21 pages

arXiv:2003.04226 [pdf, other]

MLography: An Automated Quantitative Metallography Model for Impurities Anomaly Detection using Novel Data Mining and Deep Learning Approach

Authors: Matan Rusanovsky, Gal Oren, Sigalit Ifergane, Ofer Beeri

Abstract: The micro-structure of most of the engineering alloys contains some inclusions and precipitates, which may affect their properties, therefore it is crucial to characterize them. In this work we focus on the development of a state-of-the-art artificial intelligence model for Anomaly Detection named MLography to automatically quantify the degree of anomaly of impurities in alloys. For this purpose,… ▽ More The micro-structure of most of the engineering alloys contains some inclusions and precipitates, which may affect their properties, therefore it is crucial to characterize them. In this work we focus on the development of a state-of-the-art artificial intelligence model for Anomaly Detection named MLography to automatically quantify the degree of anomaly of impurities in alloys. For this purpose, we introduce several anomaly detection measures: Spatial, Shape and Area anomaly, that successfully detect the most anomalous objects based on their objective, given that the impurities were already labeled. The first two measures quantify the degree of anomaly of each object by how each object is distant and big compared to its neighborhood, and by the abnormally of its own shape respectively. The last measure, combines the former two and highlights the most anomalous regions among all input images, for later (physical) examination. The performance of the model is presented and analyzed based on few representative cases. We stress that although the models presented here were developed for metallography analysis, most of them can be generalized to a wider set of problems in which anomaly detection of geometrical objects is desired. All models as well as the data-set that was created for this work, are publicly available at: https://github.com/matanr/MLography. △ Less

Submitted 27 February, 2020; originally announced March 2020.

Comments: 9 pages, 8 figures, 3 algorithms, 1 table

arXiv:1910.06415 [pdf, other]

BACKUS: Comprehensive High-Performance Research Software Engineering Approach for Simulations in Supercomputing Systems

Authors: Matan Rusanovsky, Re'em Harel, Lee-or Alon, Idan Mosseri, Harel Levin, Gal Oren

Abstract: High-Performance Computing (HPC) platforms enable scientific software to achieve breakthroughs in many research fields such as physics, biology, and chemistry, by employing Research Software Engineering (RSE) techniques. These include 1) novel parallelism paradigms such as Shared Memory Parallelism (with e.g. OpenMP 4.5); Distributed Memory Parallelism (with e.g. MPI 4); Hybrid Parallelism which c… ▽ More High-Performance Computing (HPC) platforms enable scientific software to achieve breakthroughs in many research fields such as physics, biology, and chemistry, by employing Research Software Engineering (RSE) techniques. These include 1) novel parallelism paradigms such as Shared Memory Parallelism (with e.g. OpenMP 4.5); Distributed Memory Parallelism (with e.g. MPI 4); Hybrid Parallelism which combines them; and Heterogeneous Parallelism (for CPUs, co-processors and accelerators), 2) introducing advanced Software Engineering concepts such as Object Oriented Parallel Programming (OOPP); Parallel Unit testing; Parallel I/O Formats; Hybrid Parallel Visualization; and 3) Selecting the Best Practices in other necessary areas such as User Interface; Automatic Documentation; Version Control and Project Management. In this work we present BACKUS: Comprehensive High-Performance Research Software Engineering Approach for Simulations in Supercomputing Systems, which we found to fit best for long-lived parallel scientific codes. △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: 19 pages, 4 figures

arXiv:1908.05700 [pdf, ps, other]

Distributed Backup Placement in One Round and its Applications to Maximum Matching Approximation and Self-Stabilization

Authors: Leonid Barenboim, Gal Oren

Abstract: In the distributed backup-placement problem each node of a network has to select one neighbor, such that the maximum number of nodes that make the same selection is minimized. This is a natural relaxation of the perfect matching problem, in which each node is selected just by one neighbor. Previous (approximate) solutions for backup placement are non-trivial, even for simple graph topologies, such… ▽ More In the distributed backup-placement problem each node of a network has to select one neighbor, such that the maximum number of nodes that make the same selection is minimized. This is a natural relaxation of the perfect matching problem, in which each node is selected just by one neighbor. Previous (approximate) solutions for backup placement are non-trivial, even for simple graph topologies, such as dense graphs. In this paper we devise an algorithm for dense graph topologies, including unit disk graphs, unit ball graphs, line graphs, graphs with bounded diversity, and many more. Our algorithm requires just one round, and is as simple as the following operation. Consider a circular list of neighborhood IDs, sorted in an ascending order, and select the ID that is next to the selecting vertex ID. Surprisingly, such a simple one-round strategy turns out to be very efficient for backup placement computation in dense networks. Not only that it improves the number of rounds of the solution, but also the approximation ratio is improved by a multiplicative factor of at least $2$. Our new algorithm has several interesting implications. In particular, it gives rise to a $(2 + ε)$-approximation to maximum matching within $O(\log^* n)$ rounds in dense networks. The resulting algorithm is very simple as well, in sharp contrast to previous algorithms that compute such a solution within this running time. Moreover, these algorithms are applicable to a narrower graph family than our algorithm. For the same graph family, the best previously-known result has $O(\log Δ + \log^* n)$ running time. Another interesting implication is the possibility to execute our backup placement algorithm as-is in the self-stabilizing setting. This makes it possible to simplify and improve other algorithms for the self-stabilizing setting, by employing helpful properties of backup placement. △ Less

Submitted 15 August, 2019; originally announced August 2019.

Comments: 8 pages

arXiv:1907.11565 [pdf, other]

Cooperative image captioning

Authors: Gilad Vered, Gal Oren, Yuval Atzmon, Gal Chechik

Abstract: When describing images with natural language, the descriptions can be made more informative if tuned using downstream tasks. This is often achieved by training two networks: a "speaker network" that generates sentences given an image, and a "listener network" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate to achieve a joint task, faces two major… ▽ More When describing images with natural language, the descriptions can be made more informative if tuned using downstream tasks. This is often achieved by training two networks: a "speaker network" that generates sentences given an image, and a "listener network" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate to achieve a joint task, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. We describe an approach that addresses both challenges. We first develop a new effective optimization based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for Partial-Sampling Straight-Through. Second, we show that the generated descriptions can be kept close to natural by constraining them to be similar to human descriptions. Together, this approach creates descriptions that are both more discriminative and more natural than previous approaches. Evaluations on the standard COCO benchmark show that PSST Multinomial dramatically improve the recall@10 from 60% to 86% maintaining comparable language naturalness, and human evaluations show that it also increases naturalness while kee** the discriminative power of generated captions. △ Less

Submitted 26 July, 2019; originally announced July 2019.

arXiv:1902.08819 [pdf, ps, other]

Fast Distributed Backup Placement in Sparse and Dense Networks

Authors: Leonid Barenboim, Gal Oren

Abstract: We consider the Backup Placement problem in networks in the $\mathcal{CONGEST}$ distributed setting. Given a network graph $G = (V,E)$, the goal of each vertex $v \in V$ is selecting a neighbor, such that the maximum number of vertices in $V$ that select the same vertex is minimized. The backup placement problem was introduced by Halldorsson, Kohler, Patt-Shamir, and Rawitz, who obtained an… ▽ More We consider the Backup Placement problem in networks in the $\mathcal{CONGEST}$ distributed setting. Given a network graph $G = (V,E)$, the goal of each vertex $v \in V$ is selecting a neighbor, such that the maximum number of vertices in $V$ that select the same vertex is minimized. The backup placement problem was introduced by Halldorsson, Kohler, Patt-Shamir, and Rawitz, who obtained an $O(\log n/ \log \log n)$ approximation with randomized polylogarithmic time. Their algorithm remained the state-of-the-art for general graphs, as well as specific graph topologies. In this paper we obtain significantly improved algorithms for various graph topologies. Specifically, we show that $O(1)$-approximation to optimal backup placement can be computed deterministically in $O(1)$ rounds in graphs that model wireless networks, certain social networks, claw-free graphs, and more generally, in any graph with neighborhood independence bounded by a constant. At the other end, we consider sparse graphs, such as trees, forests, planar graphs and graphs of constant arboricity, and obtain a constant approximation to optimal backup placement in $O(\log n)$ deterministic rounds. Clearly, our constant-time algorithms for graphs with constant neighborhood independence are asymptotically optimal. Moreover, we show that our algorithms for sparse graphs are not far from optimal as well, by proving several lower bounds. Specifically, optimal backup placement of unoriented trees requires $Ω(\log n)$ time, and approximate backup placement with a polylogarithmic approximation factor requires $Ω(\sqrt {\log n / \log \log n})$ time. Our results extend the knowledge regarding the question of "what can be computed locally?", and reveal surprising gaps between complexities of distributed symmetry breaking problems. △ Less

Submitted 14 August, 2019; v1 submitted 23 February, 2019; originally announced February 2019.

Comments: 19 pages

arXiv:1707.07823 [pdf]

Mathematical Model for Detection of Leakage in Domestic Water Supply Systems by Reading Consumption from an Analogue Water Meter

Authors: Gal Oren, Nerya Y. Stroh

Abstract: In this article we introduce the principles to detect leakage using a mathematical model based on machine learning and domestic water consumption monitoring in real time. The model uses data which is measured from a water meter, analyzes the water consumption, and uses two criteria simultaneously: deviation from the average consumption, and comparison of steady water consumptions over a period of… ▽ More In this article we introduce the principles to detect leakage using a mathematical model based on machine learning and domestic water consumption monitoring in real time. The model uses data which is measured from a water meter, analyzes the water consumption, and uses two criteria simultaneously: deviation from the average consumption, and comparison of steady water consumptions over a period of time. Simulation of the model on a regular household consumer was implemented on Antileaks - device that we have built that designed to transfer consumption information from an analogue water meter to a digital form in real time. △ Less

Submitted 25 July, 2017; originally announced July 2017.

Journal ref: International Journal of Environmental Science and Development (IJESD), Vol. 4, No. 4, International Association of Computer Science and Information Technology Press, ISSN: 2010-0264, 2013

arXiv:1707.07738 [pdf]

Adaptive Distributed Hierarchical Sensing Algorithm for Reduction of Wireless Sensor Network Cluster-Heads Energy Consumption

Authors: Gal Oren, Leonid Barenboim, Harel Levin

Abstract: Energy efficiency is a crucial performance metric in sensor networks, directly determining the network lifetime. Consequently, a key factor in WSN is to improve overall energy efficiency to extend the network lifetime. Although many algorithms have been presented to optimize the energy factor, energy efficiency is still one of the major problems of WSNs, especially when there is a need to sample a… ▽ More Energy efficiency is a crucial performance metric in sensor networks, directly determining the network lifetime. Consequently, a key factor in WSN is to improve overall energy efficiency to extend the network lifetime. Although many algorithms have been presented to optimize the energy factor, energy efficiency is still one of the major problems of WSNs, especially when there is a need to sample an area with different types of loads. Unlike other energy-efficient schemes for hierarchical sampling, our hypothesis is that it is achievable, in terms of prolonging the network lifetime, to adaptively re-modify CHs sensing rates (the processing and transmitting stages in particular) in some specific regions that are triggered significantly less than other regions. In order to do so we introduce the Adaptive Distributed Hierarchical Sensing (ADHS) algorithm. This algorithm employs a homogenous sensor network in a distributed fashion and changes the sampling rates of the CHs based on the variance of the sampled data without damaging significantly the accuracy of the sensed area. △ Less

Submitted 24 July, 2017; originally announced July 2017.

Comments: The 13th International Wireless Communications and Mobile Computing Conference

Journal ref: The 13th International Wireless Communications and Mobile Computing Conference, 2017, 980-986

arXiv:1707.07161 [pdf]

Optimizations of Management Algorithms for Multi-Level Memory Hierarchy

Authors: Gal Oren

Abstract: In the near future the SCM is predicted to modify the form of new programs, the access form to storage, and the way that storage devices themselves are built. Therefore, a combination between the SCM and a designated Memory Allocation Manager (MAM) that will allow the programmer to manually control the different memories in the memory hierarchy will be likely to achieve a new level of performance… ▽ More In the near future the SCM is predicted to modify the form of new programs, the access form to storage, and the way that storage devices themselves are built. Therefore, a combination between the SCM and a designated Memory Allocation Manager (MAM) that will allow the programmer to manually control the different memories in the memory hierarchy will be likely to achieve a new level of performance for memory-aware data structures. Although the manual MAM seems to be the optimal approach for multi-level memory hierarchy management, this technique is still very far from being realistic, and the chances that it would be implemented in current codes using High Performance Computing (HPC) platforms is quite low. This premise means that the most reasonable way to introduce the SCM into any usable and popular memory system would be by implementing an automated version of the MAM using the fundamentals of paging algorithms, as used for two-level memory hierarchy. Our hypothesis is that achieving appropriate transferability between memory levels may be possible using ideas of algorithms employed in current virtual memory systems, and that the adaptation of those algorithms from a two-level memory hierarchy to an N-level memory hierarchy is possible. In order to reach the conclusion that our hypothesis is correct, we investigated various paging algorithms, and found the ones that could be adapted successfully from two-level memory hierarchy to an N-level memory hierarchy. We discovered that using an adaptation of the Aging paging algorithm to an N-level memory hierarchy results in the best performances in terms of Hit/Miss ratio. In order to verify our hypothesis we build a simulator called "DeMemory simulator" for analyzing our algorithms as well as for other algorithms that will be devised in the future. △ Less

Submitted 22 July, 2017; originally announced July 2017.

Comments: Master's Thesis, Diss. The Open University (2015)

arXiv:1707.07137 [pdf]

AutOMP: An Automatic OpenMP Parallelization Generator for Variable-Oriented High-Performance Scientific Codes

Authors: Gal Oren, Yehuda Ganan, Guy Malamud

Abstract: OpenMP is a cross-platform API that extends C, C++ and Fortran and provides shared-memory parallelism platform for those languages. The use of many cores and HPC technologies for scientific computing has been spread since the 1990s, and now takes part in many fields of research. The relative ease of implementing OpenMP, along with the development of multi-core shared memory processors (such as Int… ▽ More OpenMP is a cross-platform API that extends C, C++ and Fortran and provides shared-memory parallelism platform for those languages. The use of many cores and HPC technologies for scientific computing has been spread since the 1990s, and now takes part in many fields of research. The relative ease of implementing OpenMP, along with the development of multi-core shared memory processors (such as Intel Xeon Phi) makes OpenMP a favorable method for parallelization in the process of modernizing a legacy codes. Legacy scientific codes are usually holding large number of physical arrays which being used and updated by the code routines. In most of the cases the parallelization of such code focuses on loop parallelization. A key step in this parallelization is deciding which of the variables in the parallelized scope should be private (so each thread will hold a copy of them), and which variables should be shared across the threads. Other important step is finding which variables should be synchronized after the loop execution. In this work we present an automatic pre-processor that preforms these stages - AutOMP (Automatic OpenMP). AutOMP recognize all the variables assignments inside a loop. These variables will be private unless the assignment is of an array element which depend on the loop index variable. Afterwards, AutOMP finds the places where threads synchronization is needed, and which reduction operator is to be used. At last, the program provides the parallelization command to be used for parallelizing the loop. △ Less

Submitted 22 July, 2017; originally announced July 2017.

Comments: The 7th International Supercomputing Conference in Mexico 2017

Showing 1–37 of 37 results for author: Oren, G