Search | arXiv e-print repository

Creating and Querying Data Cubes in Python using pyCube

Authors: Sigmundur Vang, Christian Thomsen, Torben Bach Pedersen

Abstract: Data cubes are used for analyzing large data sets usually contained in data warehouses. The most popular data cube tools use graphical user interfaces (GUI) to do the data analysis. Traditionally this was fine since data analysts were not expected to be technical people. However, in the subsequent decades the data landscape changed dramatically requiring companies to employ large teams of highly t… ▽ More Data cubes are used for analyzing large data sets usually contained in data warehouses. The most popular data cube tools use graphical user interfaces (GUI) to do the data analysis. Traditionally this was fine since data analysts were not expected to be technical people. However, in the subsequent decades the data landscape changed dramatically requiring companies to employ large teams of highly technical data scientists in order to manage and use the ever increasing amount of data. These data scientists generally use tools like Python, interactive notebooks, pandas, etc. while modern data cube tools are still GUI based. This paper proposes a Python-based data cube tool called pyCube. pyCube is able to semi-automatically create data cubes for data stored in an RDBMS and manages the data cube metadata. pyCube's programmatic interface enables data scientist to query data cubes by specifying the expected metadata of the result. pyCube is experimentally evaluated on Star Schema Benchmark (SSB). The results show that pyCube vastly outperforms different implementations of SSB queries in pandas in both runtime and memory while being easier to read and write. △ Less

Submitted 28 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Extended version of DOLAP2024 submission

arXiv:1910.07640 [pdf, ps, other]

doi 10.1007/978-3-030-31901-4_1

A Combined Deep Learning-Gradient Boosting Machine Framework for Fluid Intelligence Prediction

Authors: Yeeleng S. Vang, Yingxin Cao, Xiaohui Xie

Abstract: The ABCD Neurocognitive Prediction Challenge is a community driven competition asking competitors to develop algorithms to predict fluid intelligence score from T1-w MRIs. In this work, we propose a deep learning combined with gradient boosting machine framework to solve this task. We train a convolutional neural network to compress the high dimensional MRI data and learn meaningful image features… ▽ More The ABCD Neurocognitive Prediction Challenge is a community driven competition asking competitors to develop algorithms to predict fluid intelligence score from T1-w MRIs. In this work, we propose a deep learning combined with gradient boosting machine framework to solve this task. We train a convolutional neural network to compress the high dimensional MRI data and learn meaningful image features by predicting the 123 continuous-valued derived data provided with each MRI. These extracted features are then used to train a gradient boosting machine that predicts the residualized fluid intelligence score. Our approach achieved mean square error (MSE) scores of 18.4374, 68.7868, and 96.1806 for the training, validation, and test set respectively. △ Less

Submitted 16 October, 2019; originally announced October 2019.

Comments: Challenge in Adolescent Brain Cognitive Development Neurocognitive Prediction

Journal ref: In: Pohl K., Thompson W., Adeli E., Linguraru M. (eds) Adolescent Brain Cognitive Development Neurocognitive Prediction. ABCD-NP 2019. Lecture Notes in Computer Science, vol 11791. Springer, Cham (2019)

arXiv:1805.05373 [pdf, other]

DeepEM: Deep 3D ConvNets With EM For Weakly Supervised Pulmonary Nodule Detection

Authors: Wentao Zhu, Yeeleng S. Vang, Yufang Huang, Xiaohui Xie

Abstract: Recently deep learning has been witnessing widespread adoption in various medical image applications. However, training complex deep neural nets requires large-scale datasets labeled with ground truth, which are often unavailable in many medical image domains. For instance, to train a deep neural net to detect pulmonary nodules in lung computed tomography (CT) images, current practice is to manual… ▽ More Recently deep learning has been witnessing widespread adoption in various medical image applications. However, training complex deep neural nets requires large-scale datasets labeled with ground truth, which are often unavailable in many medical image domains. For instance, to train a deep neural net to detect pulmonary nodules in lung computed tomography (CT) images, current practice is to manually label nodule locations and sizes in many CT images to construct a sufficiently large training dataset, which is costly and difficult to scale. On the other hand, electronic medical records (EMR) contain plenty of partial information on the content of each medical image. In this work, we explore how to tap this vast, but currently unexplored data source to improve pulmonary nodule detection. We propose DeepEM, a novel deep 3D ConvNet framework augmented with expectation-maximization (EM), to mine weakly supervised labels in EMRs for pulmonary nodule detection. Experimental results show that DeepEM can lead to 1.5\% and 3.9\% average improvement in free-response receiver operating characteristic (FROC) scores on LUNA16 and Tianchi datasets, respectively, demonstrating the utility of incomplete information in EMRs for improving deep learning algorithms.\footnote{https://github.com/uci-cbcl/DeepEM-for-Weakly-Supervised-Detection.git} △ Less

Submitted 26 May, 2018; v1 submitted 14 May, 2018; originally announced May 2018.

Comments: MICCAI2018 Early Accept, Code https://github.com/uci-cbcl/DeepEM-for-Weakly-Supervised-Detection.git

arXiv:1802.00931 [pdf, other]

Deep Learning Framework for Multi-class Breast Cancer Histology Image Classification

Authors: Yeeleng S. Vang, Zhen Chen, Xiaohui Xie

Abstract: In this work, we present a deep learning framework for multi-class breast cancer image classification as our submission to the International Conference on Image Analysis and Recognition (ICIAR) 2018 Grand Challenge on BreAst Cancer Histology images (BACH). As these histology images are too large to fit into GPU memory, we first propose using Inception V3 to perform patch level classification. The… ▽ More In this work, we present a deep learning framework for multi-class breast cancer image classification as our submission to the International Conference on Image Analysis and Recognition (ICIAR) 2018 Grand Challenge on BreAst Cancer Histology images (BACH). As these histology images are too large to fit into GPU memory, we first propose using Inception V3 to perform patch level classification. The patch level predictions are then passed through an ensemble fusion framework involving majority voting, gradient boosting machine (GBM), and logistic regression to obtain the image level prediction. We improve the sensitivity of the Normal and Benign predicted classes by designing a Dual Path Network (DPN) to be used as a feature extractor where these extracted features are further sent to a second layer of ensemble prediction fusion using GBM, logistic regression, and support vector machine (SVM) to refine predictions. Experimental results demonstrate our framework shows a 12.5$\%$ improvement over the state-of-the-art model. △ Less

Submitted 3 February, 2018; originally announced February 2018.

Comments: 8 pages, 2 figures, 3 tables, ICIAR2018 workshop

arXiv:1705.08550 [pdf, other]

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification

Authors: Wentao Zhu, Qi Lou, Yeeleng Scott Vang, Xiaohui Xie

Abstract: Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods rely on regions of interest (ROIs) which require great efforts to annotate. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance… ▽ More Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods rely on regions of interest (ROIs) which require great efforts to annotate. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning (MIL) for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned ROIs. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed networks compared to previous work using segmentation and detection annotations. △ Less

Submitted 23 May, 2017; originally announced May 2017.

Comments: MICCAI 2017 Camera Ready

arXiv:1701.00593 [pdf, other]

HLA class I binding prediction via convolutional neural networks

Authors: Yeeleng Scott Vang, Xiaohui Xie

Abstract: Many biological processes are governed by protein-ligand interactions. One such example is the recognition of self and nonself cells by the immune system. This immune response process is regulated by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides can lead to the design of… ▽ More Many biological processes are governed by protein-ligand interactions. One such example is the recognition of self and nonself cells by the immune system. This immune response process is regulated by the major histocompatibility complex (MHC) protein which is encoded by the human leukocyte antigen (HLA) complex. Understanding the binding potential between MHC and peptides can lead to the design of more potent, peptide-based vaccines and immunotherapies for infectious autoimmune diseases. We apply machine learning techniques from the natural language processing (NLP) domain to address the task of MHC-peptide binding prediction. More specifically, we introduce a new distributed representation of amino acids, name HLA-Vec, that can be used for a variety of downstream proteomic machine learning tasks. We then propose a deep convolutional neural network architecture, name HLA-CNN, for the task of HLA class I-peptide binding prediction. Experimental results show combining the new distributed representation with our HLA-CNN architecture achieves state-of-the-art results in the majority of the latest two Immune Epitope Database (IEDB) weekly automated benchmark datasets. We further apply our model to predict binding on the human genome and identify 15 genes with potential for self binding. △ Less

Submitted 12 April, 2017; v1 submitted 3 January, 2017; originally announced January 2017.

Comments: 12 pages, 4 figures, 4 tables

arXiv:1612.05968 [pdf, other]

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification

Authors: Wentao Zhu, Qi Lou, Yeeleng Scott Vang, Xiaohui Xie

Abstract: Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods requires great effort to annotate the training data by costly manual labeling and specialized computational models to detect these annotations during test. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning for labeling a se… ▽ More Mammogram classification is directly related to computer-aided diagnosis of breast cancer. Traditional methods requires great effort to annotate the training data by costly manual labeling and specialized computational models to detect these annotations during test. Inspired by the success of using deep convolutional features for natural image analysis and multi-instance learning for labeling a set of instances/patches, we propose end-to-end trained deep multi-instance networks for mass classification based on whole mammogram without the aforementioned costly need to annotate the training data. We explore three different schemes to construct deep multi-instance networks for whole mammogram classification. Experimental results on the INbreast dataset demonstrate the robustness of proposed deep networks compared to previous work using segmentation and detection annotations in the training. △ Less

Submitted 18 December, 2016; originally announced December 2016.

Showing 1–7 of 7 results for author: Vang, S