Search | arXiv e-print repository

arXiv:2312.17506 [pdf, other]

A graph neural network-based model with Out-of-Distribution Robustness for enhancing Antiretroviral Therapy Outcome Prediction for HIV-1

Authors: Giulia Di Teodoro, Federico Siciliano, Valerio Guarrasi, Anne-Mieke Vandamme, Valeria Ghisetti, Anders Sönnerborg, Maurizio Zazzi, Fabrizio Silvestri, Laura Palagi

Abstract: Predicting the outcome of antiretroviral therapies for HIV-1 is a pressing clinical challenge, especially when the treatment regimen includes drugs for which limited effectiveness data is available. This scarcity of data can arise either due to the introduction of a new drug to the market or due to limited use in clinical settings. To tackle this issue, we introduce a novel joint fusion model, whi… ▽ More Predicting the outcome of antiretroviral therapies for HIV-1 is a pressing clinical challenge, especially when the treatment regimen includes drugs for which limited effectiveness data is available. This scarcity of data can arise either due to the introduction of a new drug to the market or due to limited use in clinical settings. To tackle this issue, we introduce a novel joint fusion model, which combines features from a Fully Connected (FC) Neural Network and a Graph Neural Network (GNN). The FC network employs tabular data with a feature vector made up of viral mutations identified in the most recent genotypic resistance test, along with the drugs used in therapy. Conversely, the GNN leverages knowledge derived from Stanford drug-resistance mutation tables, which serve as benchmark references for deducing in-vivo treatment efficacy based on the viral genetic sequence, to build informative graphs. We evaluated these models' robustness against Out-of-Distribution drugs in the test set, with a specific focus on the GNN's role in handling such scenarios. Our comprehensive analysis demonstrates that the proposed model consistently outperforms the FC model, especially when considering Out-of-Distribution drugs. These results underscore the advantage of integrating Stanford scores in the model, thereby enhancing its generalizability and robustness, but also extending its utility in real-world applications with limited data availability. This research highlights the potential of our approach to inform antiretroviral therapy outcome prediction and contribute to more informed clinical decisions. △ Less

Submitted 29 December, 2023; originally announced December 2023.

Comments: 32 pages, 2 figures

MSC Class: 68 ACM Class: I.2.6

arXiv:2311.04846 [pdf, other]

doi 10.1093/bioinformatics/btae327

Incorporating temporal dynamics of mutations to enhance the prediction capability of antiretroviral therapy's outcome for HIV-1

Authors: Giulia Di Teodoro, Martin Pirkl, Francesca Incardona, Ilaria Vicenti, Anders Sönnerborg, Rolf Kaiser, Laura Palagi, Maurizio Zazzi, Thomas Lengauer

Abstract: Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measure… ▽ More Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available. △ Less

Submitted 24 June, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 16 pages, 6 figures

Journal ref: Bioinformatics, Volume 40, Issue 6, June 2024, btae327

arXiv:2302.07580 [pdf, other]

doi 10.1016/j.ejco.2024.100084

Unboxing Tree Ensembles for interpretability: a hierarchical visualization tool and a multivariate optimal re-built tree

Authors: Giulia Di Teodoro, Marta Monaci, Laura Palagi

Abstract: The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability p… ▽ More The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability property resulting in "black-box" models. In light of this, we aim to develop an interpretable representation of a tree-ensemble model that can provide valuable insights into its behavior. First, given a target tree-ensemble model, we develop a hierarchical visualization tool based on a heatmap representation of the forest's feature use, considering the frequency of a feature and the level at which it is selected as an indicator of importance. Next, we propose a mixed-integer linear programming (MILP) formulation for constructing a single optimal multivariate tree that accurately mimics the target model predictions. The goal is to provide an interpretable surrogate model based on oblique hyperplane splits, which uses only the most relevant features according to the defined forest's importance indicators. The MILP model includes a penalty on feature selection based on their frequency in the forest to further induce sparsity of the splits. The natural formulation has been strengthened to improve the computational performance of {mixed-integer} software. Computational experience is carried out on benchmark datasets from the UCI repository using a state-of-the-art off-the-shelf solver. Results show that the proposed model is effective in yielding a shallow interpretable tree approximating the tree-ensemble decision function. △ Less

Submitted 18 January, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: 44 pages, 9 figures, 20 tables

arXiv:2212.04551 [pdf, other]

doi 10.1109/SBAC-PAD55451.2022.00022

Efficient Strategies for Graph Pattern Mining Algorithms on GPUs

Authors: Samuel Ferraz, Vinicius Dias, Carlos H. C. Teixeira, George Teodoro, Wagner Meira Jr

Abstract: Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it chal… ▽ More Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation demanding area. GPM computation relies on subgraph enumeration, which consists in extracting subgraphs that match a given property from an input graph. Graphics Processing Units (GPUs) have been an effective platform to accelerate applications in many areas. However, the irregularity of subgraph enumeration makes it challenging for efficient execution on GPU due to typical uncoalesced memory access, divergence, and load imbalance. Unfortunately, these aspects have not been fully addressed in previous work. Thus, this work proposes novel strategies to design and implement subgraph enumeration efficiently on GPU. We support a depth-first search style search (DFS-wide) that maximizes memory performance while providing enough parallelism to be exploited by the GPU, along with a warp-centric design that minimizes execution divergence and improves utilization of the computing capabilities. We also propose a low-cost load balancing layer to avoid idleness and redistribute work among thread warps in a GPU. Our strategies have been deployed in a system named DuMato, which provides a simple programming interface to allow efficient implementation of GPM algorithms. Our evaluation has shown that DuMato is often an order of magnitude faster than state-of-the-art GPM systems and can mine larger subgraphs (up to 12 vertices). △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted for publication on IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'22)

arXiv:2206.06182 [pdf, other]

AI-based Data Preparation and Data Analytics in Healthcare: The Case of Diabetes

Authors: Marianna Maranghi, Aris Anagnostopoulos, Irene Cannistraci, Ioannis Chatzigiannakis, Federico Croce, Giulia Di Teodoro, Michele Gentile, Giorgio Grani, Maurizio Lenzerini, Stefano Leonardi, Andrea Mastropietro, Laura Palagi, Massimiliano Pappa, Riccardo Rosati, Riccardo Valentini, Paola Velardi

Abstract: The Associazione Medici Diabetologi (AMD) collects and manages one of the largest worldwide-available collections of diabetic patient records, also known as the AMD database. This paper presents the initial results of an ongoing project whose focus is the application of Artificial Intelligence and Machine Learning techniques for conceptualizing, cleaning, and analyzing such an important and valuab… ▽ More The Associazione Medici Diabetologi (AMD) collects and manages one of the largest worldwide-available collections of diabetic patient records, also known as the AMD database. This paper presents the initial results of an ongoing project whose focus is the application of Artificial Intelligence and Machine Learning techniques for conceptualizing, cleaning, and analyzing such an important and valuable dataset, with the goal of providing predictive insights to better support diabetologists in their diagnostic and therapeutic choices. △ Less

Submitted 20 July, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: The work has been presented at the conference Ital-IA 2022 (https://www.ital-ia2022.it/)

arXiv:1910.14548 [pdf, other]

Run-time Parameter Sensitivity Analysis Optimizations

Authors: Eduardo Scartezini, Willian Barreiros Jr., Tahsin Kurc, Jun Kong, Alba C. M. A. Melo, Joel Saltz, George Teodoro

Abstract: Efficient execution of parameter sensitivity analysis (SA) is critical to allow for its routinely use. The pathology image processing application investigated in this work processes high-resolution whole-slide cancer tissue images from large datasets to characterize and classify the disease. However, the application is parameterized and changes in parameter values may significantly affect its resu… ▽ More Efficient execution of parameter sensitivity analysis (SA) is critical to allow for its routinely use. The pathology image processing application investigated in this work processes high-resolution whole-slide cancer tissue images from large datasets to characterize and classify the disease. However, the application is parameterized and changes in parameter values may significantly affect its results. Thus, understanding the impact of parameters to the output using SA is important to draw reliable scientific conclusions. The execution of the application is rather compute intensive, and a SA requires it to process the input data multiple times as parameter values are systematically varied. Optimizing this process is then important to allow for SA to be executed with large datasets. In this work, we employ a distributed computing system with novel computation reuse optimizations to accelerate SA. The new computation reuse strategy can maximize reuse even with limited memory availability where previous approaches would not be able to fully take advantage of reuse. The proposed solution was evaluated on an environment with 256 nodes (7168 CPU-cores) attaining a parallel efficiency of over 92%, and improving the previous reuse strategies in up to 2.8x. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: 8 pages, 8 figures

arXiv:1910.13082 [pdf]

Utilization of Pulse Rate Variability between Post-sleep and Wake Cycles to increase Alarm Clock Efficiency using Arduino-based Non-invasive Pulse Detector

Authors: Daniel D. Cabrales, Jhon Rafael M. Cuartero, Lejan Alfred C. Enriquez, John Gabriel Z. Erne, Bryan Dayton Edward N. Galvadores, Mark Lester S. Millan, Beryl Keziah C. Monterona, John Vincent P. Panergalin, Melbert Neil G. Teodoro, Lean Karlo S. Tolentino

Abstract: The most common reason for students' tardiness is due to waking up late every morning. The causes may vary from lack of rest and sleep, to failure of ascertaining an alarm clock's noise or sometimes, falling back to sleep again after snoozing an alarm. This paper's purpose is to address the problem concerning student's inability to continue to stay awake after turning off their alarms using Arduin… ▽ More The most common reason for students' tardiness is due to waking up late every morning. The causes may vary from lack of rest and sleep, to failure of ascertaining an alarm clock's noise or sometimes, falling back to sleep again after snoozing an alarm. This paper's purpose is to address the problem concerning student's inability to continue to stay awake after turning off their alarms using Arduino-based pulse rate triggered alarm interrupt (APRTAI). The proposed system will enable students to make use of a higher pulse rate from their thumb to turn off an alarm which can be achieved by engaging themselves in intensive physical activities that will eventually eradicate drowsiness and therefore prevent them from falling back to sleep again. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: 8 pages

Journal ref: International Higher Education Research Forum (2016) 1-7

arXiv:1811.11653 [pdf, other]

Accelerating Sensitivity Analysis in Microscopy Image Segmentation Workflows

Authors: Willian de Oliveira Barreiros Junior, George Teodoro

Abstract: With the increasingly availability of digital microscopy imagery equipments there is a demand for efficient execution of whole slide tissue image applications. Through the process of sensitivity analysis it is possible to improve the output quality of such applications, and thus, improve the desired analysis quality. Due to the high computational cost of such analyses and the recurrent nature of e… ▽ More With the increasingly availability of digital microscopy imagery equipments there is a demand for efficient execution of whole slide tissue image applications. Through the process of sensitivity analysis it is possible to improve the output quality of such applications, and thus, improve the desired analysis quality. Due to the high computational cost of such analyses and the recurrent nature of executed tasks from sensitivity analysis methods (i.e., reexecution of tasks), the opportunity for computation reuse arises. By performing computation reuse we can optimize the run time of sensitivity analysis applications. This work focuses then on finding new ways to take advantage of computation reuse opportunities on multiple task abstraction levels. This is done by presenting the coarse-grain merging strategy and the new fine-grain merging algorithms, implemented on top of the Region Templates Framework. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: 44 pages

arXiv:1810.02911 [pdf]

Tuning for Tissue Image Segmentation Workflows for Accuracy and Performance

Authors: Luis F. R. Taveira, Tahsin Kurc, Alba C. M. A. Melo, Jun Kong, Erich Bremer, Joel H. Saltz, George Teodoro

Abstract: We propose a software platform that integrates methods and tools for multi-objective parameter auto- tuning in tissue image segmentation workflows. The goal of our work is to provide an approach for improving the accuracy of nucleus/cell segmentation pipelines by tuning their input parameters. The shape, size and texture features of nuclei in tissue are important biomarkers for disease prognosis,… ▽ More We propose a software platform that integrates methods and tools for multi-objective parameter auto- tuning in tissue image segmentation workflows. The goal of our work is to provide an approach for improving the accuracy of nucleus/cell segmentation pipelines by tuning their input parameters. The shape, size and texture features of nuclei in tissue are important biomarkers for disease prognosis, and accurate computation of these features depends on accurate delineation of boundaries of nuclei. Input parameters in many nucleus segmentation workflows affect segmentation accuracy and have to be tuned for optimal performance. This is a time-consuming and computationally expensive process; automating this step facilitates more robust image segmentation workflows and enables more efficient application of image analysis in large image datasets. Our software platform adjusts the parameters of a nuclear segmentation algorithm to maximize the quality of image segmentation results while minimizing the execution time. It implements several optimization methods to search the parameter space efficiently. In addition, the methodology is developed to execute on high performance computing systems to reduce the execution time of the parameter tuning phase. Our results using three real-world image segmentation workflows demonstrate that the proposed solution is able to (1) search a small fraction (about 100 points) of the parameter space, which contains billions to trillions of points, and improve the quality of segmentation output by 1.20x, 1.29x, and 1.29x, on average; (2) decrease the execution time of a segmentation workflow by up to 11.79x while improving output quality; and (3) effectively use parallel systems to accelerate parameter tuning and segmentation phases. △ Less

Submitted 5 October, 2018; originally announced October 2018.

Comments: 29 pages, 5 figures

arXiv:1808.04795 [pdf, other]

Clumped Nuclei Segmentation with Adjacent Point Match and Local Shape based Intensity Analysis for Overlapped Nuclei in Fluorescence In-Situ Hybridization Images

Authors: Xiaoyuan Guo, Hanyi Yu, Blair Rossetti, George Teodoro, Daniel Brat, Jun Kong

Abstract: Highly clumped nuclei clusters captured in fluorescence in situ hybridization microscopy images are common histology entities under investigations in a wide spectrum of tissue-related biomedical investigations. Due to their large scale in presence, computer based image analysis is used to facilitate such analysis with improved analysis efficiency and reproducibility. To ensure the quality of downs… ▽ More Highly clumped nuclei clusters captured in fluorescence in situ hybridization microscopy images are common histology entities under investigations in a wide spectrum of tissue-related biomedical investigations. Due to their large scale in presence, computer based image analysis is used to facilitate such analysis with improved analysis efficiency and reproducibility. To ensure the quality of downstream biomedical analyses, it is essential to segment clustered nuclei with high quality. However, this presents a technical challenge commonly encountered in a large number of biomedical research, as nuclei are often overlapped due to a high cell density. In this paper, we propose an segmentation algorithm that identifies point pair connection candidates and evaluates adjacent point connections with a formulated ellipse fitting quality indicator. After connection relationships are determined, we recover the resulting dividing paths by following points with specific eigenvalues from Hessian in a constrained searching space. We validate our algorithm with 560 image patches from two classes of tumor regions of seven brain tumor patients. Both qualitative and quantitative experimental results suggest that our algorithm is promising for dividing overlapped nuclei in fluorescence in situ hybridization microscopy images widely used in various biomedical research. △ Less

Submitted 14 August, 2018; originally announced August 2018.

Comments: 4 pages

arXiv:1806.09093 [pdf, other]

Analysis of Cellular Feature Differences of Astrocytomas with Distinct Mutational Profiles Using Digitized Histopathology Images

Authors: Mousumi Roy, Fusheng Wang, George Teodoro, Jose Velazqeuz Vega, Daniel Brat, Jun Kong

Abstract: Cellular phenotypic features derived from histopathology images are the basis of pathologic diagnosis and are thought to be related to underlying molecular profiles. Due to overwhelming cell numbers and population heterogeneity, it remains challenging to quantitatively compute and compare features of cells with distinct molecular signatures. In this study, we propose a self-reliant and efficient a… ▽ More Cellular phenotypic features derived from histopathology images are the basis of pathologic diagnosis and are thought to be related to underlying molecular profiles. Due to overwhelming cell numbers and population heterogeneity, it remains challenging to quantitatively compute and compare features of cells with distinct molecular signatures. In this study, we propose a self-reliant and efficient analysis framework that supports quantitative analysis of cellular phenotypic difference across distinct molecular groups. To demonstrate efficacy, we quantitatively analyze astrocytomas that are molecularly characterized as either Isocitrate Dehydrogenase (IDH) mutant (MUT) or wildtype (WT) using imaging data from The Cancer Genome Atlas database. Representative cell instances that are phenotypically different between these two groups are retrieved after segmentation, feature computation, data pruning, dimensionality reduction, and unsupervised clustering. Our analysis is generic and can be applied to a wide set of cell-based biomedical research. △ Less

Submitted 24 June, 2018; originally announced June 2018.

arXiv:1806.09090 [pdf, other]

Segmentation of Overlapped Steatosis in Whole-Slide Liver Histopathology Microscopy Images

Authors: Mousumi Roy, Fusheng Wang, George Teodoro, Miriam B Vos, Alton Brad Farris, Jun Kong

Abstract: An accurate steatosis quantification with pathology tissue samples is of high clinical importance. However, such pathology measurement is manually made in most clinical practices, subject to severe reader variability due to large sampling bias and poor reproducibility. Although some computerized automated methods are developed to quantify the steatosis regions, they present limited analysis capaci… ▽ More An accurate steatosis quantification with pathology tissue samples is of high clinical importance. However, such pathology measurement is manually made in most clinical practices, subject to severe reader variability due to large sampling bias and poor reproducibility. Although some computerized automated methods are developed to quantify the steatosis regions, they present limited analysis capacity for high resolution whole-slide microscopy images and accurate overlapped steatosis division. In this paper, we propose a method that extracts an individual whole tissue piece at high resolution with minimum background area by estimating tissue bounding box and rotation angle. This is followed by the segmentation and segregation of steatosis regions with high curvature point detection and an ellipse fitting quality assessment method. We validate our method with isolated and overlapped steatosis regions in liver tissue images of 11 patients. The experimental results suggest that our method is promising for enhanced support of steatosis quantization during the pathology review for liver disease treatment. △ Less

Submitted 24 June, 2018; originally announced June 2018.

arXiv:1711.07295 [pdf, other]

Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

Authors: Edans F. O. Sandes, George Teodoro, Alba C. M. A. Melo

Abstract: The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as overlap, Jaccard, dice or cosine. The naive approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact Set Similarity Join algorithms are usually based on the Filter-Verifica… ▽ More The Exact Set Similarity Join problem aims to find all similar sets between two collections of sets, with respect to a threshold and a similarity function such as overlap, Jaccard, dice or cosine. The naive approach verifies all pairs of sets and it is often considered impractical due the high number of combinations. So, Exact Set Similarity Join algorithms are usually based on the Filter-Verification Framework, that applies a series of filters to reduce the number of verified pairs. This paper presents a new filtering technique called Bitmap Filter, which is able to accelerate state-of-the-art algorithms for the exact Set Similarity Join problem. The Bitmap Filter uses hash functions to create bitmaps of fixed b bits, representing characteristics of the sets. Then, it applies bitwise operations (such as xor and population count) on the bitmaps in order to infer a similarity upper bound for each pair of sets. If the upper bound is below a given similarity threshold, the pair of sets is pruned. The Bitmap Filter benefits from the fact that bitwise operations are efficiently implemented by many modern general-purpose processors and it was easily applied to four state-of-the-art algorithms implemented in CPU: AllPairs, PPJoin, AdaptJoin and GroupJoin. Furthermore, we propose a Graphic Processor Unit (GPU) algorithm based on the naive approach but using the Bitmap Filter to speedup the computation. The experiments considered 9 collections containing from 100 thousands up to 10 million sets and the joins were made using Jaccard thresholds from 0.50 to 0.95. The Bitmap Filter was able to improve 90% of the experiments in CPU, with speedups of up to 4.50x and 1.43x on average. Using the GPU algorithm, the experiments were able to speedup the original CPU algorithms by up to 577x using an Nvidia Geforce GTX 980 Ti. △ Less

Submitted 20 November, 2017; originally announced November 2017.

Comments: 13 pages, 14 figures

arXiv:1612.03413 [pdf, other]

Efficient Methods and Parallel Execution for Algorithm Sensitivity Analysis with Parameter Tuning on Microscopy Imaging Datasets

Authors: George Teodoro, Tahsin Kurc, Luis F. R. Taveira, Alba C. M. A. Melo, Jun Kong, Joel Saltz

Abstract: Background: We describe an informatics framework for researchers and clinical investigators to efficiently perform parameter sensitivity analysis and auto-tuning for algorithms that segment and classify image features in a large dataset of high-resolution images. The computational cost of the sensitivity analysis process can be very high, because the process requires processing the input dataset s… ▽ More Background: We describe an informatics framework for researchers and clinical investigators to efficiently perform parameter sensitivity analysis and auto-tuning for algorithms that segment and classify image features in a large dataset of high-resolution images. The computational cost of the sensitivity analysis process can be very high, because the process requires processing the input dataset several times to systematically evaluate how output varies when input parameters are varied. Thus, high performance computing techniques are required to quickly execute the sensitivity analysis process. Results: We carried out an empirical evaluation of the proposed method on high performance computing clusters with multi-core CPUs and co-processors (GPUs and Intel Xeon Phis). Our results show that (1) the framework achieves excellent scalability and efficiency on a high performance computing cluster -- execution efficiency remained above 85% in all experiments; (2) the parameter auto-tuning methods are able to converge by visiting only a small fraction (0.0009%) of the search space with limited impact to the algorithm output (0.56% on average). Conclusions: The sensitivity analysis framework provides a range of strategies for the efficient exploration of the parameter space, as well as multiple indexes to evaluate the effect of parameter modification to outputs or even correlation between parameters. Our work demonstrates the feasibility of performing sensitivity analyses, parameter studies, and auto-tuning with large datasets with the use of high-performance systems and techniques. The proposed technologies will enable the quantification of error estimations and output variations in these pipelines, which may be used in application specific ways to assess uncertainty of conclusions extracted from data generated by these image analysis pipelines. △ Less

Submitted 11 December, 2016; originally announced December 2016.

Comments: 36 pages, 10 figures

arXiv:1605.00930 [pdf, other]

Efficient Execution of Irregular Wavefront Propagation Pattern on Many Integrated Core Architecture

Authors: Jeremias Gomes, George Teodoro

Abstract: The efficient execution of image processing algorithms is an active area of Bioinformatics. In image processing, one of the classes of algorithms or computing pattern that works with irregular data structures is the Irregular Wavefront Propagation Pattern (IWPP). In this class, elements propagate information to neighbors in the form of wave propagation. This propagation results in irregular access… ▽ More The efficient execution of image processing algorithms is an active area of Bioinformatics. In image processing, one of the classes of algorithms or computing pattern that works with irregular data structures is the Irregular Wavefront Propagation Pattern (IWPP). In this class, elements propagate information to neighbors in the form of wave propagation. This propagation results in irregular access to data and expansions. Due to this irregularity, current implementations of this class of algorithms requires atomic operations, which is very costly and also restrains implementations with Single Instruction, Multiple Data (SIMD) instructions in Many Integrated Core (MIC) architectures, which are critical to attain high performance on this processor. The objective of this study is to redesign the Irregular Wavefront Propagation Pattern algorithm in order to enable the efficient execution on processors with Many Integrated Core architecture using SIMD instructions. In this work, using the Intel (R) Xeon Phi (TM) coprocessor, we have implemented a vector version of IWPP with up to 5.63x gains on non-vectored version, a parallel version using First In, First Out (FIFO) queue that attained speedup up to 55x as compared to the single core version on the coprocessor, a version using priority queue whose performance was 1.62x better than the fastest version of GPU based implementation available in the literature, and a cooperative version between heterogeneous processors that allow to process images bigger than the Intel (R) Xeon Phi (TM) memory and also provides a way to utilize all the available devices in the computation. △ Less

Submitted 3 May, 2016; originally announced May 2016.

Comments: in Portuguese

arXiv:1505.03819 [pdf, other]

Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs

Authors: George Teodoro, Tahsin Kurc, Guilherme Andrade, Jun Kong, Renato Ferreira, Joel Saltz

Abstract: We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexitie… ▽ More We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations. △ Less

Submitted 14 May, 2015; originally announced May 2015.

Comments: 22 pages, 12 figures, 6 tables

arXiv:1405.7958 [pdf, other]

Region Templates: Data Representation and Management for Large-Scale Image Analysis

Authors: George Teodoro, Tony Pan, Tahsin Kurc, Jun Kong, Lee Cooper, Scott Klasky, Joel Saltz

Abstract: Distributed memory machines equipped with CPUs and GPUs (hybrid computing nodes) are hard to program because of the multiple layers of memory and heterogeneous computing configurations. In this paper, we introduce a region template abstraction for the efficient management of common data types used in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The re… ▽ More Distributed memory machines equipped with CPUs and GPUs (hybrid computing nodes) are hard to program because of the multiple layers of memory and heterogeneous computing configurations. In this paper, we introduce a region template abstraction for the efficient management of common data types used in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The region template provides a generic container template for common data structures, such as points, arrays, regions, and object sets, within a spatial and temporal bounding box. The region template abstraction enables different data management strategies and data I/O implementations, while providing a homogeneous, unified interface to the application for data storage and retrieval. The execution of region templates applications is coordinated by a runtime system that supports efficient execution in hybrid machines. Region templates applications are represented as hierarchical dataflow in which each computing stage may be represented as another dataflow of finer-grain tasks. A number of optimizations for hybrid machines are available in our runtime system, including performance-aware scheduling for maximizing utilization of computing devices and techniques to reduce impact of data transfers between CPUs and GPUs. An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging study shows that this abstraction adds negligible overhead (about 3%) and achieves good scalability. △ Less

Submitted 30 May, 2014; originally announced May 2014.

Comments: 43 pages, 17 figures

arXiv:1311.0378 [pdf, other]

Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU

Authors: George Teodoro, Tahsin Kurc, Jun Kong, Lee Cooper, Joel Saltz

Abstract: We investigate and characterize the performance of an important class of operations on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy image scanners. We identify the data access and computatio… ▽ More We investigate and characterize the performance of an important class of operations on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy image scanners. We identify the data access and computation patterns of operations in object segmentation and feature computation categories. We systematically implement and evaluate the performance of these core operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that (1) the data access pattern and parallelization strategy employed by the operations strongly affect their performance. While the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU; (2) GPUs are significantly more efficient than MICs for operations and algorithms that irregularly access data. This is a result of the low performance of the latter when it comes to random data access; (3) adequate coordinated execution on MICs and CPUs using a performance aware task scheduling strategy improves about 1.29x over a first-come-first-served strategy. The example application attained an efficiency of 84% in an execution with of 192 nodes (3072 CPU cores and 192 MICs). △ Less

Submitted 2 November, 2013; originally announced November 2013.

Comments: 11 pages, 2 figures

ACM Class: C.4; D.1.3; D.2.6

arXiv:1310.4136 [pdf, other]

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Authors: Thiago S. F. X. Teixeira, George Teodoro, Eduardo Valle, Joel H. Saltz

Abstract: Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while kee** low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are… ▽ More Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while kee** low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are sequential and target shared-memory machines. Here we address these issues with a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular techniques for similarity search, but its poor referential locality properties has made its implementation a challenging problem. Our solution is based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage. The proposed parallelization attained an efficiency of 90% in a distributed system with about 800 CPU cores. In particular, the original locality-aware data partition reduced the number of messages exchanged in 30%. Our parallel LSH was evaluated using the largest public dataset for similarity search (to the best of our knowledge) with $10^9$ 128-d SIFT descriptors extracted from Web images. This is two orders of magnitude larger than datasets that previous LSH parallelizations could handle. △ Less

Submitted 15 October, 2013; originally announced October 2013.

arXiv:1209.3332 [pdf, other]

High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms

Authors: George Teodoro, Tony Pan, Tahsin M. Kurc, Jun Kong, Lee A. D. Cooper, Joel H. Saltz

Abstract: We propose, implement, and experimentally evaluate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics processing units (GPUs). Our work targets scientific analysis applications in which datasets are processed in applica… ▽ More We propose, implement, and experimentally evaluate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics processing units (GPUs). Our work targets scientific analysis applications in which datasets are processed in application-specific data chunks, and the processing of a data chunk is expressed as a hierarchical pipeline of operations. The proposed middleware system combines a bag-of-tasks style execution with coarse-grain dataflow execution. Data chunks and associated data processing pipelines are scheduled across cluster nodes using a demand driven approach, while within a node operations in a given pipeline instance are scheduled across CPUs and GPUs. The runtime system implements several optimizations, including performance aware task scheduling, architecture aware process placement, data locality conscious task assignment, and data prefetching and asynchronous data copy, to maximize utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. The application and performance benefits of the runtime middleware are demonstrated using an image analysis application, which is employed in a brain cancer study, on a state-of-the-art hybrid cluster in which each node has two 6-core CPUs and three GPUs. Our results show that implementing and scheduling application data processing as a set of fine-grain operations provide more opportunities for runtime optimizations and attain better performance than a coarser-grain, monolithic implementation. The proposed runtime system can achieve high-throughput processing of large datasets - we were able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles at about 150 tiles/second rate on 100 nodes. △ Less

Submitted 14 September, 2012; originally announced September 2012.

Comments: 12 pages, 14 figures

arXiv:1209.3314 [pdf, other]

Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

Authors: George Teodoro, Tony Pan, Tahsin Kurc, Jun Kong, Lee Cooper, Joel Saltz

Abstract: In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satis… ▽ More In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively. △ Less

Submitted 14 September, 2012; originally announced September 2012.

Comments: 37 pages, 16 figures

arXiv:1209.0410 [pdf, other]

Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms

Authors: George Teodoro, Eduardo Valle, Nathan Mariano, Ricardo Torres, Wagner Meira Jr, Joel H. Saltz

Abstract: Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observ… ▽ More Similarity search in high-dimentional spaces is a pivotal operation found a variety of database applications. Recently, there has been an increase interest in similarity search for online content-based multimedia services. Those services, however, introduce new challenges with respect to the very large volumes of data that have to be indexed/searched, and the need to minimize response times observed by the end-users. Additionally, those users dynamically interact with the systems creating fluctuating query request rates, requiring the search algorithm to adapt in order to better utilize the underline hardware to reduce response times. In order to address these challenges, we introduce hypercurves, a flexible framework for answering approximate k-nearest neighbor (kNN) queries for very large multimedia databases, aiming at online content-based multimedia services. Hypercurves executes on hybrid CPU--GPU environments, and is able to employ those devices cooperatively to support massive query request rates. In order to keep the response times optimal as the request rates vary, it employs a novel dynamic scheduler to partition the work between CPU and GPU. Hypercurves was throughly evaluated using a large database of multimedia descriptors. Its cooperative CPU--GPU execution achieved performance improvements of up to 30x when compared to the single CPU-core version. The dynamic work partition mechanism reduces the observed query response times in about 50% when compared to the best static CPU--GPU task partition configuration. In addition, Hypercurves achieves superlinear scalability in distributed (multi-node) executions, while kee** a high guarantee of equivalence with its sequential version --- thanks to the proof of probabilistic equivalence, which supported its aggressive parallelization design. △ Less

Submitted 3 September, 2012; originally announced September 2012.

Comments: 25 pages

Showing 1–22 of 22 results for author: Teodoro, G