Search | arXiv e-print repository

arXiv:2406.10893 [pdf, other]

Development and Validation of Fully Automatic Deep Learning-Based Algorithms for Immunohistochemistry Reporting of Invasive Breast Ductal Carcinoma

Authors: Sumit Kumar Jha, Purnendu Mishra, Shubham Mathur, Gursewak Singh, Rajiv Kumar, Kiran Aatre, Suraj Rengarajan

Abstract: Immunohistochemistry (IHC) analysis is a well-accepted and widely used method for molecular subty**, a procedure for prognosis and targeted therapy of breast carcinoma, the most common type of tumor affecting women. There are four molecular biomarkers namely progesterone receptor (PR), estrogen receptor (ER), antigen Ki67, and human epidermal growth factor receptor 2 (HER2) whose assessment is n… ▽ More Immunohistochemistry (IHC) analysis is a well-accepted and widely used method for molecular subty**, a procedure for prognosis and targeted therapy of breast carcinoma, the most common type of tumor affecting women. There are four molecular biomarkers namely progesterone receptor (PR), estrogen receptor (ER), antigen Ki67, and human epidermal growth factor receptor 2 (HER2) whose assessment is needed under IHC procedure to decide prognosis as well as predictors of response to therapy. However, IHC scoring is based on subjective microscopic examination of tumor morphology and suffers from poor reproducibility, high subjectivity, and often incorrect scoring in low-score cases. In this paper, we present, a deep learning-based semi-supervised trained, fully automatic, decision support system (DSS) for IHC scoring of invasive ductal carcinoma. Our system automatically detects the tumor region removing artifacts and scores based on Allred standard. The system is developed using 3 million pathologist-annotated image patches from 300 slides, fifty thousand in-house cell annotations, and forty thousand pixels marking of HER2 membrane. We have conducted multicentric trials at four centers with three different types of digital scanners in terms of percentage agreement with doctors. And achieved agreements of 95, 92, 88 and 82 percent for Ki67, HER2, ER, and PR stain categories, respectively. In addition to overall accuracy, we found that there is 5 percent of cases where pathologist have changed their score in favor of algorithm score while reviewing with detailed algorithmic analysis. Our approach could improve the accuracy of IHC scoring and subsequent therapy decisions, particularly where specialist expertise is unavailable. Our system is highly modular. The proposed algorithm modules can be used to develop DSS for other cancer types. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2403.19477 [pdf, other]

Real-time Geoinformation Systems to Improve the Quality, Scalability, and Cost of Internet of Things for Agri-environment Research

Authors: Bryan C. Runck, Bobby Schulz, Jeff Bishop, Nathan Carlson, Bryan Chantigian, Gary Deters, Jesse Erdmann, Patrick M. Ewing, Michael Felzan, Xiao Fu, Jan Greyling, Christopher J. Hogan, Andrew Hollman, Ali Joglekar, Kris Junker, Michael Kantar, Lumbani Kaunda, Mohana Krishna, Benjamin Lynch, Peter Marchetto, Megan Marsolek, Troy McKay, Brad Morris, Ali Rashid Niaghi, Keerthi Pamulaparthy , et al. (19 additional authors not shown)

Abstract: With the increasing emphasis on machine learning and artificial intelligence to drive knowledge discovery in the agricultural sciences, spatial internet of things (IoT) technologies have become increasingly important for collecting real-time, high resolution data for these models. However, managing large fleets of devices while maintaining high data quality remains an ongoing challenge as scientis… ▽ More With the increasing emphasis on machine learning and artificial intelligence to drive knowledge discovery in the agricultural sciences, spatial internet of things (IoT) technologies have become increasingly important for collecting real-time, high resolution data for these models. However, managing large fleets of devices while maintaining high data quality remains an ongoing challenge as scientists iterate from prototype to mature end-to-end applications. Here, we provide a set of case studies using the framework of technology readiness levels for an open source spatial IoT system. The spatial IoT systems underwent 3 major and 14 minor system versions, had over 2,727 devices manufactured both in academic and commercial contexts, and are either in active or planned deployment across four continents. Our results show the evolution of a generalizable, open source spatial IoT system designed for agricultural scientists, and provide a model for academic researchers to overcome the challenges that exist in going from one-off prototypes to thousands of internet-connected devices. △ Less

Submitted 2 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: 20 pages, 5 figures, 1 table

arXiv:2310.12080 [pdf]

Aperiodic MEG abnormality in patients with focal to bilateral tonic-clonic seizures

Authors: Kirandeep Kaur, Jonathan J Horsley, Csaba Kozma, Gerard R Hall, Thomas W Owen, Yujiang Wang, Guarav Singh, Sarat P Chandra, Manjari Tripathi, Peter N Taylor

Abstract: Aperiodic activity is a physiologically distinct component of the electrophysiological power spectrum. It is suggested to reflect the balance of excitation and inhibition in the brain, within selected frequency bands. However, the impact of recurrent seizures on aperiodic activity remains unknown, particularly in patients with severe bilateral seizures. Here, we hypothesised greater aperiodic abno… ▽ More Aperiodic activity is a physiologically distinct component of the electrophysiological power spectrum. It is suggested to reflect the balance of excitation and inhibition in the brain, within selected frequency bands. However, the impact of recurrent seizures on aperiodic activity remains unknown, particularly in patients with severe bilateral seizures. Here, we hypothesised greater aperiodic abnormality in the epileptogenic zone, in patients with focal to bilateral tonic clonic (FBTC) seizures, and earlier age of seizure onset. Pre-operative magnetoencephalography (MEG) recordings were acquired from 36 patients who achieved complete seizure freedom (Engel I outcome) post-surgical resection. A normative whole brain map of the aperiodic exponent was computed by averaging across subjects for each region in the hemisphere contralateral to the side of resection. Selected regions of interest were then tested for abnormality using deviations from the normative map in terms of z-scores. Resection masks drawn from postoperative structural imaging were used as an approximation of the epileptogenic zone. Patients with FBTC seizures had greater abnormality compared to patients with focal onset seizures alone in the resection volume (p=0.003, area under the ROC curve = 0.78 ). Earlier age of seizure onset was correlated with greater abnormality of the aperiodic exponent in the resection volume (correlation coefficient = -0.3, p= 0.04)) as well as the whole cortex (rho = -0.33, p=0.03). The abnormality of the aperiodic exponent did not significantly differ between the resected and non-resected regions of the brain. Abnormalities in aperiodic components relate to important clinical characteristics such as severity and age of seizure onset. This suggests the potential use of the aperiodic band power component as a marker for severity of epilepsy. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.04366 [pdf, other]

Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors

Authors: Taha Shahroodi, Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, Said Hamdioui

Abstract: Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of… ▽ More Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNNs. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Sword-fish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%. △ Less

Submitted 26 November, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2023

arXiv:2309.17063 [pdf, other]

GateSeeder: Near-memory CPU-FPGA Acceleration of Short and Long Read Map**

Authors: Julien Eudine, Mohammed Alser, Gagandeep Singh, Can Alkan, Onur Mutlu

Abstract: Motivation: Read map** is a computationally expensive process and a major bottleneck in genomics analyses. The performance of read map** is mainly limited by the performance of three key computational steps: Index Querying, Seed Chaining, and Sequence Alignment. The first step is dominated by how fast and frequent it accesses the main memory (i.e., memory-bound), while the latter two steps are… ▽ More Motivation: Read map** is a computationally expensive process and a major bottleneck in genomics analyses. The performance of read map** is mainly limited by the performance of three key computational steps: Index Querying, Seed Chaining, and Sequence Alignment. The first step is dominated by how fast and frequent it accesses the main memory (i.e., memory-bound), while the latter two steps are dominated by how fast the CPU can compute their computationally-costly dynamic programming algorithms (i.e., compute-bound). Accelerating these three steps by exploiting new algorithms and new hardware devices is essential to accelerate most genome analysis pipelines that widely use read map**. Given the large body of work on accelerating Sequence Alignment, this work focuses on significantly improving the remaining steps. Results: We introduce GateSeeder, the first CPU-FPGA-based near-memory acceleration of both short and long read map**. GateSeeder exploits near-memory computation capability provided by modern FPGAs that couple a reconfigurable compute fabric with high-bandwidth memory (HBM) to overcome the memory-bound and compute-bound bottlenecks. GateSeeder also introduces a new lightweight algorithm for finding the potential matching segment pairs. Using real ONT, HiFi, and Illumina sequences, we experimentally demonstrate that GateSeeder outperforms Minimap2, without performing sequence alignment, by up to 40.3x, 4.8x, and 2.3x, respectively. When performing read map** with sequence alignment, GateSeeder outperforms Minimap2 by 1.15-4.33x (using KSW2) and by 1.97-13.63x (using WFA-GPU). Availability: https://github.com/CMU-SAFARI/GateSeeder △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2306.00838 [pdf, other]

The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI

Authors: Ahmed W. Moawad, Anastasia Janas, Ujjwal Baid, Divya Ramakrishnan, Rachit Saluja, Nader Ashraf, Leon Jekel, Raisa Amiruddin, Maruf Adewole, Jake Albrecht, Udunna Anazodo, Sanjay Aneja, Syed Muhammad Anwar, Timothy Bergquist, Evan Calabrese, Veronica Chiang, Verena Chung, Gian Marco Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Ariana Familiar, Keyvan Farahani, Juan Eugenio Iglesias, Zhifan Jiang , et al. (206 additional authors not shown)

Abstract: The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and chara… ▽ More The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and characterizes the challenging cases that impacted the performance of the winning algorithms. Untreated brain metastases on standard anatomic MRI sequences (T1, T2, FLAIR, T1PG) from eight contributed international datasets were annotated in stepwise method: published UNET algorithms, student, neuroradiologist, final approver neuroradiologist. Segmentations were ranked based on lesion-wise Dice and Hausdorff distance (HD95) scores. False positives (FP) and false negatives (FN) were rigorously penalized, receiving a score of 0 for Dice and a fixed penalty of 374 for HD95. Eight datasets comprising 1303 studies were annotated, with 402 studies (3076 lesions) released on Synapse as publicly available datasets to challenge competitors. Additionally, 31 studies (139 lesions) were held out for validation, and 59 studies (218 lesions) were used for testing. Segmentation accuracy was measured as rank across subjects, with the winning team achieving a LesionWise mean score of 7.9. Common errors among the leading teams included false negatives for small lesions and misregistration of masks in space.The BraTS-METS 2023 challenge successfully curated well-annotated, diverse datasets and identified common errors, facilitating the translation of BM segmentation across varied clinical environments and providing personalized volumetric reports to patients undergoing BM treatment. △ Less

Submitted 17 June, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

arXiv:2301.09200 [pdf, other]

doi 10.1093/bioinformatics/btad272

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Authors: Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, Onur Mutlu

Abstract: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the se… ▽ More Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read map**, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash. △ Less

Submitted 1 June, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

Comments: To appear in proceedings of ISMB/ECCB 2023

arXiv:2212.04953 [pdf, other]

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Authors: Meryem Banu Cavlak, Gagandeep Singh, Mohammed Alser, Can Firtina, Joël Lindegger, Mohammad Sadrosadati, Nika Mansouri Ghiasi, Can Alkan, Onur Mutlu

Abstract: Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for… ▽ More Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall aims to filter out all off-target reads before basecalling. The highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance while maintaining high sensitivity in kee** on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, throughput, sensitivity, precision, and generality compared to prior works. We open-source TargetCall at https://github.com/CMU-SAFARI/TargetCall △ Less

Submitted 14 September, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

arXiv:2211.03079 [pdf, other]

RUBICON: A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Authors: Gagandeep Singh, Mohammed Alser, Kristof Denolf, Can Firtina, Alireza Khodamoradi, Meryem Banu Cavlak, Henk Corporaal, Onur Mutlu

Abstract: Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands… ▽ More Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by develo** RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96x speedup with 2.97% higher accuracy. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models. △ Less

Submitted 5 February, 2024; v1 submitted 6 November, 2022; originally announced November 2022.

arXiv:2205.07957 [pdf]

Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis

Authors: Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu

Abstract: We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologie… ▽ More We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent. The analysis script and data used in our experimental evaluation are available at: https://github.com/CMU-SAFARI/Molecules2Variations △ Less

Submitted 16 May, 2022; originally announced May 2022.

Comments: arXiv admin note: text overlap with arXiv:2008.00961

arXiv:2205.05883 [pdf, other]

doi 10.1145/3470496.3527436

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Map**

Authors: Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zülal Bingöl, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gómez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu

Abstract: A critical step of genome sequence analysis is the map** of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence map**). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in… ▽ More A critical step of genome sequence analysis is the map** of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence map**). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Map** reads to the graph-based reference genome (i.e., sequence-to-graph map**) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence map** is well studied with many available tools and accelerators, sequence-to-graph map** is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph map** tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph map**. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic map** accelerator that can effectively and efficiently support both sequence-to-graph map** and sequence-to-sequence map**, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph map**. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph and sequence-to-sequence map** pipelines. △ Less

Submitted 31 May, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: To appear in ISCA'22

arXiv:2112.08687 [pdf, other]

doi 10.1093/nargab/lqad004

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Authors: Can Firtina, Jisung Park, Mohammed Alser, Jeremie S. Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, Onur Mutlu

Abstract: Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only e… ▽ More Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND 1) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and 2) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlap** and read map**. For read overlap**, BLEND is faster by 2.4x - 83.9x (on average 19.3x), has a lower memory footprint by 0.9x - 14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read map**, BLEND is faster by 0.8x - 4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND. △ Less

Submitted 23 May, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: Published in NARGAB

Journal ref: NAR Genomics and Bioinformatics, vol. 5, no. 1, p. lqad004, Mar. 2023

arXiv:2007.08028 [pdf]

Predicting Clinical Outcomes in COVID-19 using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study

Authors: Joseph Bae, Saarthak Kapse, Gagandeep Singh, Rishabh Gattu, Syed Ali, Neal Shah, Colin Marshall, Jonathan Pierce, Tej Phatak, Amit Gupta, Jeremy Green, Nikhil Madan, Prateek Prasanna

Abstract: We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning cl… ▽ More We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using patient CXRs. A novel radiomic embedding framework was also explored for outcome prediction. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic and DL classification models had mAUCs of 0.78+/-0.02 and 0.81+/-0.04, compared with expert scores mAUCs of 0.75+/-0.02 and 0.79+/-0.05 for mechanical ventilation requirement and mortality prediction, respectively. Combined classifiers using both radiomics and expert severity scores resulted in mAUCs of 0.79+/-0.04 and 0.83+/-0.04 for each prediction task, demonstrating improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances where inclusion of radiomic features in DL improves model predictions, something that might be explored in other pathologies. The models proposed in this study and the prognostic information they provide might aid physician decision making and resource allocation during the COVID-19 pandemic. △ Less

Submitted 1 July, 2021; v1 submitted 15 July, 2020; originally announced July 2020.

Comments: Joseph Bae and Saarthak Kapse have contributed equally to this work

ACM Class: J.3; I.2.6

arXiv:2006.09483 [pdf]

Can tumor location on pre-treatment MRI predict likelihood of pseudo-progression versus tumor recurrence in Glioblastoma? A feasibility study

Authors: Marwa Ismail, Virginia Hill, Volodymyr Statsevych, Evan Mason, Ramon Correa, Prateek Prasanna, Gagandeep Singh, Kaustav Bera, Rajat Thawani, Anant Madabhushi, Manmeet Ahluwalia, Pallavi Tiwari

Abstract: A significant challenge in Glioblastoma (GBM) management is identifying pseudo-progression (PsP), a benign radiation-induced effect, from tumor recurrence, on routine imaging following conventional treatment. Previous studies have linked tumor lobar presence and laterality to GBM outcomes, suggesting that disease etiology and progression in GBM may be impacted by tumor location. Hence, in this fea… ▽ More A significant challenge in Glioblastoma (GBM) management is identifying pseudo-progression (PsP), a benign radiation-induced effect, from tumor recurrence, on routine imaging following conventional treatment. Previous studies have linked tumor lobar presence and laterality to GBM outcomes, suggesting that disease etiology and progression in GBM may be impacted by tumor location. Hence, in this feasibility study, we seek to investigate the following question: Can tumor location on treatment-naïve MRI provide early cues regarding likelihood of a patient develo** pseudo-progression versus tumor recurrence? In this study, 74 pre-treatment Glioblastoma MRI scans with PsP (33) and tumor recurrence (41) were analyzed. First, enhancing lesion on Gd-T1w MRI and peri-lesional hyperintensities on T2w/FLAIR were segmented by experts and then registered to a brain atlas. Using patients from the two phenotypes, we construct two atlases by quantifying frequency of occurrence of enhancing lesion and peri-lesion hyperintensities, by averaging voxel intensities across the population. Analysis of differential involvement was then performed to compute voxel-wise significant differences (p-value<0.05) across the atlases. Statistically significant clusters were finally mapped to a structural atlas to provide anatomic localization of their location. Our results demonstrate that patients with tumor recurrence showed prominence of their initial tumor in the parietal lobe, while patients with PsP showed a multi-focal distribution of the initial tumor in the frontal and temporal lobes, insula, and putamen. These preliminary results suggest that lateralization of pre-treatment lesions towards certain anatomical areas of the brain may allow to provide early cues regarding assessing likelihood of occurrence of pseudo-progression from tumor recurrence on MRI scans. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:1906.08669 [pdf]

Petri-net modeling of B-cell receptor signaling pathways: A case study in CLL

Authors: Gajendra Pratap Singh, Madhuri Jha

Abstract: Immunology is the emerging research area which deals with the study of the immune system in any living organism. It is modelled through various computational and mathematical models to deal with the problem facing while to boost the immune system of an organism or to fight with the infectious disease at the very initial stage. Such models are very important for a better understanding of the comple… ▽ More Immunology is the emerging research area which deals with the study of the immune system in any living organism. It is modelled through various computational and mathematical models to deal with the problem facing while to boost the immune system of an organism or to fight with the infectious disease at the very initial stage. Such models are very important for a better understanding of the complex behaviour of pathways inside the cells. The signalling pathways between the cells are complex and difficult to visualize in the immune system of human beings. So, it's important to study the function of these cells separately. T-cells and B-cells are an important part of the immune system and both have their own receptors and their different signalling pathways by which they deal with any antigens. In this paper, we discuss the B-cell receptor and its different signalling pathways downstream of the BCR. We designed a Petri-net model of the process of gathering antigens through B-cells independent of T-cell and the effect of that in the immune system of the organism. We will also discuss the contribution of BCR in the selection of the precursor tumour cell in CLL. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: 7 pages, 6 figures, unpublished work

MSC Class: 68Rxx ACM Class: F.2.2

arXiv:1807.00789 [pdf]

The exon junction complex undergoes a compositional switch that alters mRNP structure and nonsense-mediated mRNA decay activity

Authors: Justin W. Mabin, Lauren A. Woodward, Robert Patton, Zhongxia Yi, Mengxuan Jia, Vicki Wysocki, Ralf Bundschuh, Guramrit Singh

Abstract: The exon junction complex (EJC) deposited upstream of mRNA exon junctions shapes structure, composition and fate of spliced mRNA ribonucleoprotein particles (mRNPs). To achieve this, the EJC core nucleates assembly of a dynamic shell of peripheral proteins that function in diverse post-transcriptional processes. To illuminate consequences of EJC composition change, we purified EJCs from human cell… ▽ More The exon junction complex (EJC) deposited upstream of mRNA exon junctions shapes structure, composition and fate of spliced mRNA ribonucleoprotein particles (mRNPs). To achieve this, the EJC core nucleates assembly of a dynamic shell of peripheral proteins that function in diverse post-transcriptional processes. To illuminate consequences of EJC composition change, we purified EJCs from human cells via peripheral proteins RNPS1 and CASC3. We show that EJC originates as an SR-rich mega-dalton sized RNP that contains RNPS1 but lacks CASC3. After mRNP export to the cytoplasm and before translation, the EJC undergoes a remarkable compositional and structural remodeling into an SR-devoid monomeric complex that contains CASC3. Surprisingly, RNPS1 is important for nonsense-mediated mRNA decay (NMD) in general whereas CASC3 is needed for NMD of only select mRNAs. The promotion of switch to CASC3-EJC slows down NMD. Overall, the EJC compositional switch dramatically alters mRNP structure and specifies two distinct phases of EJC-dependent NMD. △ Less

Submitted 2 July, 2018; originally announced July 2018.

arXiv:1404.4405 [pdf, other]

doi 10.1063/1.4897978

Arginine-Phosphate Salt Bridges Between Histones and DNA: Intermolecular Actuators that Control Nucleosome Architecture

Authors: Tahir I. Yusufaly, Yun Li, Gautam Singh, Wilma K. Olson

Abstract: Structural bioinformatics and van der Waals density functional theory are combined to investigate the mechanochemical impact of a major class of histone-DNA interactions, namely the formation of salt bridges between arginine residues in histones and phosphate groups on the DNA backbone. Principal component analysis reveals that the configurational fluctuations of the sugar-phosphate backbone displ… ▽ More Structural bioinformatics and van der Waals density functional theory are combined to investigate the mechanochemical impact of a major class of histone-DNA interactions, namely the formation of salt bridges between arginine residues in histones and phosphate groups on the DNA backbone. Principal component analysis reveals that the configurational fluctuations of the sugar-phosphate backbone display sequence-specific variability, and clustering of nucleosomal crystal structures identifies two major salt bridge configurations: a monodentate form in which the arginine end-group guanidinium only forms one hydrogen bond with the phosphate, and a bidentate form in which it forms two. Density functional theory calculations highlight that the combination of sequence, denticity and salt bridge positioning enable the histones to tunably activate specific backbone deformations via mechanochemical stress. The results suggest that selection for specific placements of van der Waals contacts, with high-precision control of the spatial distribution of intermolecular forces, may serve as an underlying evolutionary design principle for the structure and function of nucleosomes, a conjecture that is corroborated by previous experimental studies. △ Less

Submitted 30 September, 2014; v1 submitted 16 April, 2014; originally announced April 2014.

Comments: Revised version - Accepted for publication in J. Chem. Phys

arXiv:0812.3426 [pdf, ps, other]

doi 10.1063/1.3103496

Topological Methods for Exploring Low-density States in Biomolecular Folding Pathways

Authors: Yuan Yao, Jian Sun, Xuhui Huang, Gregory R. Bowman, Gurjeet Singh, Michael Lesnick

Abstract: Characterization of transient intermediate or transition states is crucial for the description of biomolecular folding pathways, which is however difficult in both experiments and computer simulations. Such transient states are typically of low population in simulation samples. Even for simple systems such as RNA hairpins, recently there are mounting debates over the existence of multiple interm… ▽ More Characterization of transient intermediate or transition states is crucial for the description of biomolecular folding pathways, which is however difficult in both experiments and computer simulations. Such transient states are typically of low population in simulation samples. Even for simple systems such as RNA hairpins, recently there are mounting debates over the existence of multiple intermediate states. In this paper, we develop a computational approach to explore the relatively low populated transition or intermediate states in biomolecular folding pathways, based on a topological data analysis tool, Mapper, with simulation data from large-scale distributed computing. The method is inspired by the classical Morse theory in mathematics which characterizes the topology of high dimensional shapes via some functional level sets. In this paper we exploit a conditional density filter which enables us to focus on the structures on pathways, followed by clustering analysis on its level sets, which helps separate low populated intermediates from high populated uninteresting structures. A successful application of this method is given on a motivating example, a RNA hairpin with GCAA tetraloop, where we are able to provide structural evidence from computer simulations on the multiple intermediate states and exhibit different pictures about unfolding and refolding pathways. The method is effective in dealing with high degree of heterogeneity in distribution, capturing structural features in multiple pathways, and being less sensitive to the distance metric than nonlinear dimensionality reduction or geometric embedding methods. It provides us a systematic tool to explore the low density intermediate states in complex biomolecular folding systems. △ Less

Submitted 17 December, 2008; originally announced December 2008.

Comments: 23 pages, 6 figures

Showing 1–18 of 18 results for author: Singh, G