-
DSAM: A Deep Learning Framework for Analyzing Temporal and Spatial Dynamics in Brain Networks
Authors:
Bishal Thapaliya,
Robyn Miller,
Jiayu Chen,
Yu-** Wang,
Esra Akbas,
Ram Sapkota,
Bhaskar Ray,
Pranav Suresh,
Santosh Ghimire,
Vince Calhoun,
**gyu Liu
Abstract:
Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimpl…
▽ More
Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimplifying brain dynamics and lack proper consideration of the goal at hand. While deep learning has gained substantial popularity for modeling complex relational data, its application to uncovering the spatiotemporal dynamics of the brain is still limited. We propose a novel interpretable deep learning framework that learns goal-specific functional connectivity matrix directly from time series and employs a specialized graph neural network for the final classification. Our model, DSAM, leverages temporal causal convolutional networks to capture the temporal dynamics in both low- and high-level feature representations, a temporal attention unit to identify important time points, a self-attention unit to construct the goal-specific connectivity matrix, and a novel variant of graph neural network to capture the spatial dynamics for downstream classification. To validate our approach, we conducted experiments on the Human Connectome Project dataset with 1075 samples to build and interpret the model for the classification of sex group, and the Adolescent Brain Cognitive Development Dataset with 8520 samples for independent testing. Compared our proposed framework with other state-of-art models, results suggested this novel approach goes beyond the assumption of a fixed connectivity matrix and provides evidence of goal-specific brain connectivity patterns, which opens up the potential to gain deeper insights into how the human brain adapts its functional connectivity specific to the task at hand.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram
Authors:
Aehong Min,
Xuan Wang,
Rion Brattig Correia,
Jordan Rozum,
Wendy R. Miller,
Luis M. Rocha
Abstract:
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. Open…
▽ More
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Gen-T: Table Reclamation in Data Lakes
Authors:
Grace Fan,
Roee Shraga,
Renée J. Miller
Abstract:
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomp…
▽ More
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.
△ Less
Submitted 22 March, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Model Lakes
Authors:
Koyena Pal,
David Bau,
Renée J. Miller
Abstract:
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue o…
▽ More
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue of finding, differentiating, and understanding models is becoming more crucial. Inspired from research on data lakes, we introduce and define the concept of model lakes. We discuss fundamental research challenges in the management of large models. And we discuss what principled data management techniques can be brought to bear on the study of large model management.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse
Authors:
Bradley T. Baker,
Barak A. Pearlmutter,
Robyn Miller,
Vince D. Calhoun,
Sergey M. Plis
Abstract:
Our understanding of learning dynamics of deep neural networks (DNNs) remains incomplete. Recent research has begun to uncover the mathematical principles underlying these networks, including the phenomenon of "Neural Collapse", where linear classifiers within DNNs converge to specific geometrical structures during late-stage training. However, the role of geometric constraints in learning extends…
▽ More
Our understanding of learning dynamics of deep neural networks (DNNs) remains incomplete. Recent research has begun to uncover the mathematical principles underlying these networks, including the phenomenon of "Neural Collapse", where linear classifiers within DNNs converge to specific geometrical structures during late-stage training. However, the role of geometric constraints in learning extends beyond this terminal phase. For instance, gradients in fully-connected layers naturally develop a low-rank structure due to the accumulation of rank-one outer products over a training batch. Despite the attention given to methods that exploit this structure for memory saving or regularization, the emergence of low-rank learning as an inherent aspect of certain DNN architectures has been under-explored. In this paper, we conduct a comprehensive study of gradient rank in DNNs, examining how architectural choices and structure of the data effect gradient rank bounds. Our theoretical analysis provides these bounds for training fully-connected, recurrent, and convolutional neural networks. We also demonstrate, both theoretically and empirically, how design choices like activation function linearity, bottleneck layer introduction, convolutional stride, and sequence truncation influence these bounds. Our findings not only contribute to the understanding of learning dynamics in DNNs, but also provide practical guidance for deep learning engineers to make informed design decisions.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Unsupervised Learning of Graph from Recipes
Authors:
Aissatou Diallo,
Antonis Bikakis,
Luke Dickens,
Anthony Hunter,
Rob Miller
Abstract:
Cooking recipes are one of the most readily available kinds of procedural text. They consist of natural language instructions that can be challenging to interpret. In this paper, we propose a model to identify relevant information from recipes and generate a graph to represent the sequence of actions in the recipe. In contrast with other approaches, we use an unsupervised approach. We iteratively…
▽ More
Cooking recipes are one of the most readily available kinds of procedural text. They consist of natural language instructions that can be challenging to interpret. In this paper, we propose a model to identify relevant information from recipes and generate a graph to represent the sequence of actions in the recipe. In contrast with other approaches, we use an unsupervised approach. We iteratively learn the graph structure and the parameters of a $\mathsf{GNN}$ encoding the texts (text-to-graph) one sequence at a time while providing the supervision by decoding the graph into text (graph-to-text) and comparing the generated text to the input. We evaluate the approach by comparing the identified entities with annotated datasets, comparing the difference between the input and output texts, and comparing our generated graphs with those generated by state of the art methods.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
PizzaCommonSense: Learning to Model Commonsense Reasoning about Intermediate Steps in Cooking Recipes
Authors:
Aissatou Diallo,
Antonis Bikakis,
Luke Dickens,
Anthony Hunter,
Rob Miller
Abstract:
Decoding the core of procedural texts, exemplified by cooking recipes, is crucial for intelligent reasoning and instruction automation. Procedural texts can be comprehensively defined as a sequential chain of steps to accomplish a task employing resources. From a cooking perspective, these instructions can be interpreted as a series of modifications to a food preparation, which initially comprises…
▽ More
Decoding the core of procedural texts, exemplified by cooking recipes, is crucial for intelligent reasoning and instruction automation. Procedural texts can be comprehensively defined as a sequential chain of steps to accomplish a task employing resources. From a cooking perspective, these instructions can be interpreted as a series of modifications to a food preparation, which initially comprises a set of ingredients. These changes involve transformations of comestible resources. For a model to effectively reason about cooking recipes, it must accurately discern and understand the inputs and outputs of intermediate steps within the recipe. Aiming to address this, we present a new corpus of cooking recipes enriched with descriptions of intermediate steps of the recipes that explicate the input and output for each step. We discuss the data collection process, investigate and provide baseline models based on T5 and GPT-3.5. This work presents a challenging task and insight into commonsense reasoning and procedural text generation.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
Artificial Intelligence for Digital and Computational Pathology
Authors:
Andrew H. Song,
Guillaume Jaume,
Drew F. K. Williamson,
Ming Y. Lu,
Anurag Vaidya,
Tiffany R. Miller,
Faisal Mahmood
Abstract:
Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based syst…
▽ More
Advances in digitizing tissue slides and the fast-paced progress in artificial intelligence, including deep learning, have boosted the field of computational pathology. This field holds tremendous potential to automate clinical diagnosis, predict patient prognosis and response to therapy, and discover new morphological biomarkers from tissue images. Some of these artificial intelligence-based systems are now getting approved to assist clinical diagnosis; however, technical barriers remain for their widespread clinical adoption and integration as a research tool. This Review consolidates recent methodological advances in computational pathology for predicting clinical end points in whole-slide images and highlights how these developments enable the automation of clinical practice and the discovery of new biomarkers. We then provide future perspectives as the field expands into a broader range of clinical and research tasks with increasingly diverse modalities of clinical data.
△ Less
Submitted 12 December, 2023;
originally announced January 2024.
-
Improving age prediction: Utilizing LSTM-based dynamic forecasting for data augmentation in multivariate time series analysis
Authors:
Yutong Gao,
Charles A. Ellis,
Vince D. Calhoun,
Robyn L. Miller
Abstract:
The high dimensionality and complexity of neuroimaging data necessitate large datasets to develop robust and high-performing deep learning models. However, the neuroimaging field is notably hampered by the scarcity of such datasets. In this work, we proposed a data augmentation and validation framework that utilizes dynamic forecasting with Long Short-Term Memory (LSTM) networks to enrich datasets…
▽ More
The high dimensionality and complexity of neuroimaging data necessitate large datasets to develop robust and high-performing deep learning models. However, the neuroimaging field is notably hampered by the scarcity of such datasets. In this work, we proposed a data augmentation and validation framework that utilizes dynamic forecasting with Long Short-Term Memory (LSTM) networks to enrich datasets. We extended multivariate time series data by predicting the time courses of independent component networks (ICNs) in both one-step and recursive configurations. The effectiveness of these augmented datasets was then compared with the original data using various deep learning models designed for chronological age prediction tasks. The results suggest that our approach improves model performance, providing a robust solution to overcome the challenges presented by the limited size of neuroimaging datasets.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
RAMPART: RowHammer Mitigation and Repair for Server Memory Systems
Authors:
Steven C. Woo,
Wendy Elsasser,
Mike Hamburg,
Eric Linstadt,
Michael R. Miller,
Taeksang Song,
James Tringali
Abstract:
RowHammer attacks are a growing security and reliability concern for DRAMs and computer systems as they can induce many bit errors that overwhelm error detection and correction capabilities. System-level solutions are needed as process technology and circuit improvements alone are unlikely to provide complete protection against RowHammer attacks in the future. This paper introduces RAMPART, a nove…
▽ More
RowHammer attacks are a growing security and reliability concern for DRAMs and computer systems as they can induce many bit errors that overwhelm error detection and correction capabilities. System-level solutions are needed as process technology and circuit improvements alone are unlikely to provide complete protection against RowHammer attacks in the future. This paper introduces RAMPART, a novel approach to mitigating RowHammer attacks and improving server memory system reliability by remap** addresses in each DRAM in a way that confines RowHammer bit flips to a single device for any victim row address. When RAMPART is paired with Single Device Data Correction (SDDC) and patrol scrub, error detection and correction methods in use today, the system can detect and correct bit flips from a successful attack, allowing the memory system to heal itself. RAMPART is compatible with DDR5 RowHammer mitigation features, as well as a wide variety of algorithmic and probabilistic tracking methods. We also introduce BRC-VL, a variation of DDR5 Bounded Refresh Configuration (BRC) that improves system performance by reducing mitigation overhead and show that it works well with probabilistic sampling methods to combat traditional and victim-focused mitigation attacks like Half-Double. The combination of RAMPART, SDDC, and scrubbing enables stronger RowHammer resistance by correcting bit flips from one successful attack. Uncorrectable errors are much less likely, requiring two successful attacks before the memory system is scrubbed.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Blend: A Unified Data Discovery System
Authors:
Mahdi Esmailoghli,
Christoph Schnell,
Renée J. Miller,
Ziawasch Abedjan
Abstract:
Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary…
▽ More
Data discovery is an iterative and incremental process that necessitates the execution of multiple data discovery queries to identify the desired tables from large and diverse data lakes. Current methodologies concentrate on single discovery tasks such as join, correlation, or union discovery. However, in practice, a series of these approaches and their corresponding index structures are necessary to enable the user to discover the desired tables. This paper presents BLEND, a comprehensive data discovery system that empowers users to develop ad-hoc discovery tasks without the need to develop new algorithms or build a new index structure. To achieve this goal, we introduce a general index structure capable of addressing multiple discovery queries. We develop a set of lower-level operators that serve as the fundamental building blocks for more complex and sophisticated user tasks. These operators are highly efficient and enable end-to-end efficiency. To enhance the execution of the discovery pipeline, we rewrite the search queries into optimized SQL statements to push the data operators down to the database. We demonstrate that our holistic system is able to achieve comparable effectiveness and runtime efficiency to the individual state-of-the-art approaches specifically designed for a single task.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Generative Benchmark Creation for Table Union Search
Authors:
Koyena Pal,
Aamod Khatiwada,
Roee Shraga,
Renée J. Miller
Abstract:
Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is fi…
▽ More
Data management has traditionally relied on synthetic data generators to generate structured benchmarks, like the TPC suite, where we can control important parameters like data size and its distribution precisely. These benchmarks were central to the success and adoption of database management systems. But more and more, data management problems are of a semantic nature. An important example is finding tables that can be unioned. While any two tables with the same cardinality can be unioned, table union search is the problem of finding tables whose union is semantically coherent. Semantic problems cannot be benchmarked using synthetic data. Our current methods for creating benchmarks involve the manual curation and labeling of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how robust the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create tables with specified properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable but related. We thoroughly evaluate recent existing table union search methods over existing benchmarks and our new benchmark. We also present and evaluate a new table search methods based on recent large language models over all benchmarks. We show that the new benchmark is more challenging for all methods than hand-curated benchmarks, specifically, the top-performing method achieves a Mean Average Precision of around 60%, over 30% less than its performance on existing manually created benchmarks. We examine why this is the case and show that the new benchmark permits more detailed analysis of methods, including a study of both false positives and false negatives that were not possible with existing benchmarks.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
A Graphical Formalism for Commonsense Reasoning with Recipes
Authors:
Antonis Bikakis,
Aissatou Diallo,
Luke Dickens,
Anthony Hunter,
Rob Miller
Abstract:
Whilst cooking is a very important human activity, there has been little consideration given to how we can formalize recipes for use in a reasoning framework. We address this need by proposing a graphical formalization that captures the comestibles (ingredients, intermediate food items, and final products), and the actions on comestibles in the form of a labelled bipartite graph. We then propose f…
▽ More
Whilst cooking is a very important human activity, there has been little consideration given to how we can formalize recipes for use in a reasoning framework. We address this need by proposing a graphical formalization that captures the comestibles (ingredients, intermediate food items, and final products), and the actions on comestibles in the form of a labelled bipartite graph. We then propose formal definitions for comparing recipes, for composing recipes from subrecipes, and for deconstructing recipes into subrecipes. We also introduce and compare two formal definitions for substitution into recipes which are required when there are missing ingredients, or some actions are not possible, or because there is a need to change the final product somehow.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
DIALITE: Discover, Align and Integrate Open Data Tables
Authors:
Aamod Khatiwada,
Roee Shraga,
Renée J. Miller
Abstract:
We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze…
▽ More
We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze the integration result by applying different downstreaming tasks over it. Our pipeline is flexible such that the user can easily add and compare additional discovery and integration algorithms.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes
Authors:
William F. Godoy,
Pedro Valero-Lara,
T. Elise Dettling,
Christian Trefftz,
Ian Jorquera,
Thomas Sheehy,
Ross G. Miller,
Marc Gonzalez-Tallada,
Jeffrey S. Vetter,
Valentin Churavy
Abstract:
We explore the performance and portability of the high-level programming models: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership Computing Facilities. We comp…
▽ More
We explore the performance and portability of the high-level programming models: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership Computing Facilities. We compare the default performance of a hand-rolled dense matrix multiplication algorithm on CPUs against vendor-compiled C/OpenMP implementations, and on each GPU against CUDA and HIP. Rather than focusing on the kernel optimization per-se, we select this naive approach to resemble exploratory work in science and as a lower-bound for performance to isolate the effect of each programming model. Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs. Performance gaps are identified on NVIDIA A100 GPUs for Julia's single precision and Kokkos, and for Python/Numba in all scenarios. We also comment on half-precision support, productivity, performance portability metrics, and platform readiness. We expect to contribute to the understanding and direction for high-level, high-productivity languages in HPC as the first-generation exascale systems are deployed.
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
A Large-Scale Study of Personal Identifiability of Virtual Reality Motion Over Time
Authors:
Mark Roman Miller,
Eugy Han,
Cyan DeVeaux,
Eliot Jones,
Ryan Chen,
Jeremy N. Bailenson
Abstract:
In recent years, social virtual reality (VR), sometimes described as the "metaverse," has become widely available. With its potential comes risks, including risks to privacy. To understand these risks, we study the identifiability of participants' motion in VR in a dataset of 232 VR users with eight weekly sessions of about thirty minutes each, totaling 764 hours of social interaction. The sample…
▽ More
In recent years, social virtual reality (VR), sometimes described as the "metaverse," has become widely available. With its potential comes risks, including risks to privacy. To understand these risks, we study the identifiability of participants' motion in VR in a dataset of 232 VR users with eight weekly sessions of about thirty minutes each, totaling 764 hours of social interaction. The sample is unique as we are able to study the effect of user, session, and time independently. We find that the number of sessions recorded greatly increases identifiability, and duration per session increases identifiability as well, but to a lesser degree. We also find that greater delay between training and testing sessions reduces identifiability. Ultimately, understanding the identifiability of VR activities will help designers, security professionals, and consumer advocates make VR safer.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)
Authors:
Roee Shraga,
Renée J. Miller
Abstract:
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two give…
▽ More
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Large Scale Radio Frequency Wideband Signal Detection & Recognition
Authors:
Luke Boegner,
Garrett Vanhoy,
Phillip Vallance,
Manbir Gulati,
Dresden Feitzinger,
Bradley Comar,
Robert D. Miller
Abstract:
Applications of deep learning to the radio frequency (RF) domain have largely concentrated on the task of narrowband signal classification after the signals of interest have already been detected and extracted from a wideband capture. To encourage broader research with wideband operations, we introduce the WidebandSig53 (WBSig53) dataset which consists of 550 thousand synthetically-generated sampl…
▽ More
Applications of deep learning to the radio frequency (RF) domain have largely concentrated on the task of narrowband signal classification after the signals of interest have already been detected and extracted from a wideband capture. To encourage broader research with wideband operations, we introduce the WidebandSig53 (WBSig53) dataset which consists of 550 thousand synthetically-generated samples from 53 different signal classes containing approximately 2 million unique signals. We extend the TorchSig signal processing machine learning toolkit for open-source and customizable generation, augmentation, and processing of the WBSig53 dataset. We conduct experiments using state of the art (SoTA) convolutional neural networks and transformers with the WBSig53 dataset. We investigate the performance of signal detection tasks, i.e. detect the presence, time, and frequency of all signals present in the input data, as well as the performance of signal recognition tasks, where networks detect the presence, time, frequency, and modulation family of all signals present in the input data. Two main approaches to these tasks are evaluated with segmentation networks and object detection networks operating on complex input spectrograms. Finally, we conduct comparative analysis of the various approaches in terms of the networks' mean average precision, mean average recall, and the speed of inference.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
CommsVAE: Learning the brain's macroscale communication dynamics using coupled sequential VAEs
Authors:
Eloy Geenjaar,
Noah Lewis,
Amrit Kashyap,
Robyn Miller,
Vince Calhoun
Abstract:
Communication within or between complex systems is commonplace in the natural sciences and fields such as graph neural networks. The brain is a perfect example of such a complex system, where communication between brain regions is constantly being orchestrated. To analyze communication, the brain is often split up into anatomical regions that each perform certain computations. These regions must i…
▽ More
Communication within or between complex systems is commonplace in the natural sciences and fields such as graph neural networks. The brain is a perfect example of such a complex system, where communication between brain regions is constantly being orchestrated. To analyze communication, the brain is often split up into anatomical regions that each perform certain computations. These regions must interact and communicate with each other to perform tasks and support higher-level cognition. On a macroscale, these regions communicate through signal propagation along the cortex and along white matter tracts over longer distances. When and what types of signals are communicated over time is an unsolved problem and is often studied using either functional or structural data. In this paper, we propose a non-linear generative approach to communication from functional data. We address three issues with common connectivity approaches by explicitly modeling the directionality of communication, finding communication at each timestep, and encouraging sparsity. To evaluate our model, we simulate temporal data that has sparse communication between nodes embedded in it and show that our model can uncover the expected communication dynamics. Subsequently, we apply our model to temporal neural data from multiple tasks and show that our approach models communication that is more specific to each task. The specificity of our method means it can have an impact on the understanding of psychiatric disorders, which are believed to be related to highly specific communication between brain regions compared to controls. In sum, we propose a general model for dynamic communication learning on graphs, and show its applicability to a subfield of the natural sciences, with potential widespread scientific impact.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
Authors:
Grace Fan,
** Wang,
Yuliang Li,
Dan Zhang,
Renée Miller
Abstract:
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder o…
▽ More
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
△ Less
Submitted 15 January, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
SANTOS: Relationship-based Semantic Table Union Search
Authors:
Aamod Khatiwada,
Grace Fan,
Roee Shraga,
Zixuan Chen,
Wolfgang Gatterbauer,
Renée J. Miller,
Mirek Riedewald
Abstract:
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new n…
▽ More
Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Application Experiences on a GPU-Accelerated Arm-based HPC Testbed
Authors:
Wael Elwasif,
William Godoy,
Nick Hagerty,
J. Austin Harris,
Oscar Hernandez,
Balint Joo,
Paul Kent,
Damien Lebrun-Grandie,
Elijah Maccarthy,
Veronica G. Melesse Vergara,
Bronson Messer,
Ross Miller,
Sarp Opal,
Sergei Bastrakov,
Michael Bussmann,
Alexander Debus,
Klaus Steinger,
Jan Stephan,
Rene Widera,
Spencer H. Bryngelson,
Henry Le Berre,
Anand Radhakrishnan,
Jefferey Young,
Sunita Chandrasekaran,
Florina Ciorba
, et al. (6 additional authors not shown)
Abstract:
This paper assesses and reports the experience of ten teams working to port,validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems built by GIGABYTE, each one equipped with a server-class Arm CPU from Ampere Computing and A100 data center GPU from NVIDIA Corp. The syst…
▽ More
This paper assesses and reports the experience of ten teams working to port,validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems built by GIGABYTE, each one equipped with a server-class Arm CPU from Ampere Computing and A100 data center GPU from NVIDIA Corp. The systems are connected together using Infiniband high-bandwidth low-latency interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust and easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments.
△ Less
Submitted 19 December, 2022; v1 submitted 20 September, 2022;
originally announced September 2022.
-
Large Scale Radio Frequency Signal Classification
Authors:
Luke Boegner,
Manbir Gulati,
Garrett Vanhoy,
Phillip Vallance,
Bradley Comar,
Silvija Kokalj-Filipovic,
Craig Lennon,
Robert D. Miller
Abstract:
Existing datasets used to train deep learning models for narrowband radio frequency (RF) signal classification lack enough diversity in signal types and channel impairments to sufficiently assess model performance in the real world. We introduce the Sig53 dataset consisting of 5 million synthetically-generated samples from 53 different signal classes and expertly chosen impairments. We also introd…
▽ More
Existing datasets used to train deep learning models for narrowband radio frequency (RF) signal classification lack enough diversity in signal types and channel impairments to sufficiently assess model performance in the real world. We introduce the Sig53 dataset consisting of 5 million synthetically-generated samples from 53 different signal classes and expertly chosen impairments. We also introduce TorchSig, a signals processing machine learning toolkit that can be used to generate this dataset. TorchSig incorporates data handling principles that are common to the vision domain, and it is meant to serve as an open-source foundation for future signals machine learning research. Initial experiments using the Sig53 dataset are conducted using state of the art (SoTA) convolutional neural networks (ConvNets) and Transformers. These experiments reveal Transformers outperform ConvNets without the need for additional regularization or a ConvNet teacher, which is contrary to results from the vision domain. Additional experiments demonstrate that TorchSig's domain-specific data augmentations facilitate model training, which ultimately benefits model performance. Finally, TorchSig supports on-the-fly synthetic data creation at training time, thus enabling massive scale training sessions with virtually unlimited datasets.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Spatio-temporally separable non-linear latent factor learning: an application to somatomotor cortex fMRI data
Authors:
Eloy Geenjaar,
Amrit Kashyap,
Noah Lewis,
Robyn Miller,
Vince Calhoun
Abstract:
Functional magnetic resonance imaging (fMRI) data contain complex spatiotemporal dynamics, thus researchers have developed approaches that reduce the dimensionality of the signal while extracting relevant and interpretable dynamics. Models of fMRI data that can perform whole-brain discovery of dynamical latent factors are understudied. The benefits of approaches such as linear independent componen…
▽ More
Functional magnetic resonance imaging (fMRI) data contain complex spatiotemporal dynamics, thus researchers have developed approaches that reduce the dimensionality of the signal while extracting relevant and interpretable dynamics. Models of fMRI data that can perform whole-brain discovery of dynamical latent factors are understudied. The benefits of approaches such as linear independent component analysis models have been widely appreciated, however, nonlinear extensions of these models present challenges in terms of identification. Deep learning methods provide a way forward, but new methods for efficient spatial weight-sharing are critical to deal with the high dimensionality of the data and the presence of noise. Our approach generalizes weight sharing to non-Euclidean neuroimaging data by first performing spectral clustering based on the structural and functional similarity between voxels. The spectral clusters and their assignments can then be used as patches in an adapted multi-layer perceptron (MLP)-mixer model to share parameters among input points. To encourage temporally independent latent factors, we use an additional total correlation term in the loss. Our approach is evaluated on data with multiple motor sub-tasks to assess whether the model captures disentangled latent factors that correspond to each sub-task. Then, to assess the latent factors we find further, we compare the spatial location of each latent factor to the motor homunculus. Finally, we show that our approach captures task effects better than the current gold standard of source signal separation, independent component analysis (ICA).
△ Less
Submitted 26 May, 2022;
originally announced May 2022.
-
Small Cohort of Epilepsy Patients Showed Increased Activity on Facebook before Sudden Unexpected Death
Authors:
Ian B. Wood,
Rion Brattig Correia,
Wendy R. Miller,
Luis M. Rocha
Abstract:
Sudden Unexpected Death in Epilepsy (SUDEP) remains a leading cause of death in people with epilepsy. Despite the constant risk for patients and bereavement to family members, to date the physiological mechanisms of SUDEP remain unknown. Here we explore the potential to identify putative predictive signals of SUDEP from online digital behavioral data using text and sentiment analysis. Specifically…
▽ More
Sudden Unexpected Death in Epilepsy (SUDEP) remains a leading cause of death in people with epilepsy. Despite the constant risk for patients and bereavement to family members, to date the physiological mechanisms of SUDEP remain unknown. Here we explore the potential to identify putative predictive signals of SUDEP from online digital behavioral data using text and sentiment analysis. Specifically, we analyze Facebook timelines of six epilepsy patients deceased due to SUDEP, donated by surviving family members. We find preliminary evidence for behavioral changes detectable by text and sentiment analysis tools. Namely, in the months preceding their SUDEP event patient social media timelines show: i) increase in verbosity; ii) increased use of functional words; and iii) sentiment shifts as measured by different sentiment analysis tools. Combined, these results suggest that social media engagement, as well as its sentiment, may serve as possible early-warning signals for SUDEP in people with epilepsy. While the small sample of patient timelines analyzed in this study prevents generalization, our preliminary investigation demonstrates the potential of social media data as complementary data in larger studies of SUDEP and epilepsy.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)
Authors:
Sijia Liu,
Andrew Wen,
Liwei Wang,
Huan He,
Sunyang Fu,
Robert Miller,
Andrew Williams,
Daniel Harris,
Ramakanth Kavuluru,
Mei Liu,
Noor Abu-el-rub,
Dalton Schutte,
Rui Zhang,
Masoud Rouhizadeh,
John D. Osborne,
Yongqun He,
Umit Topaloglu,
Stephanie S Hong,
Joel H Saltz,
Thomas Schaffter,
Emily Pfaff,
Christopher G. Chute,
Tim Duong,
Melissa A. Haendel,
Rafael Fuentes
, et al. (7 additional authors not shown)
Abstract:
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algori…
▽ More
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.
△ Less
Submitted 21 March, 2022; v1 submitted 20 October, 2021;
originally announced October 2021.
-
Repurposing of Resources: from Everyday Problem Solving through to Crisis Management
Authors:
Antonis Bikakis,
Luke Dickens,
Anthony Hunter,
Rob Miller
Abstract:
The human ability to repurpose objects and processes is universal, but it is not a well-understood aspect of human intelligence. Repurposing arises in everyday situations such as finding substitutes for missing ingredients when cooking, or for unavailable tools when doing DIY. It also arises in critical, unprecedented situations needing crisis management. After natural disasters and during wartime…
▽ More
The human ability to repurpose objects and processes is universal, but it is not a well-understood aspect of human intelligence. Repurposing arises in everyday situations such as finding substitutes for missing ingredients when cooking, or for unavailable tools when doing DIY. It also arises in critical, unprecedented situations needing crisis management. After natural disasters and during wartime, people must repurpose the materials and processes available to make shelter, distribute food, etc. Repurposing is equally important in professional life (e.g. clinicians often repurpose medicines off-license) and in addressing societal challenges (e.g. finding new roles for waste products,). Despite the importance of repurposing, the topic has received little academic attention. By considering examples from a variety of domains such as every-day activities, drug repurposing and natural disasters, we identify some principle characteristics of the process and describe some technical challenges that would be involved in modelling and simulating it. We consider cases of both substitution, i.e. finding an alternative for a missing resource, and exploitation, i.e. identifying a new role for an existing resource. We argue that these ideas could be developed into general formal theory of repurposing, and that this could then lead to the development of AI methods based on commonsense reasoning, argumentation, ontological reasoning, and various machine learning methods, to develop tools to support repurposing in practice.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Reservoir Based Edge Training on RF Data To Deliver Intelligent and Efficient IoT Spectrum Sensors
Authors:
Silvija Kokalj-Filipovic,
Paul Toliver,
William Johnson,
Rob Miller
Abstract:
Current radio frequency (RF) sensors at the Edge lack the computational resources to support practical, in-situ training for intelligent spectrum monitoring, and sensor data classification in general. We propose a solution via Deep Delay Loop Reservoir Computing (DLR), a processing architecture that supports general machine learning algorithms on compact mobile devices by leveraging delay-loop res…
▽ More
Current radio frequency (RF) sensors at the Edge lack the computational resources to support practical, in-situ training for intelligent spectrum monitoring, and sensor data classification in general. We propose a solution via Deep Delay Loop Reservoir Computing (DLR), a processing architecture that supports general machine learning algorithms on compact mobile devices by leveraging delay-loop reservoir computing in combination with innovative electrooptical hardware. With both digital and photonic realizations of our design of the loops, DLR delivers reductions in form factor, hardware complexity and latency, compared to the State-of-the-Art (SoA). The main impact of the reservoir is to project the input data into a higher dimensional space of reservoir state vectors in order to linearly separate the input classes. Once the classes are well separated, traditionally complex, power-hungry classification models are no longer needed for the learning process. Yet, even with simple classifiers based on Ridge regression (RR), the complexity grows at least quadratically with the input size. Hence, the hardware reduction required for training on compact devices is in contradiction with the large dimension of state vectors. DLR employs a RR-based classifier to exceed the SoA accuracy, while further reducing power consumption by leveraging the architecture of parallel (split) loops. We present DLR architectures composed of multiple smaller loops whose state vectors are linearly combined to create a lower dimensional input into Ridge regression. We demonstrate the advantages of using DLR for two distinct applications: RF Specific Emitter Identification (SEI) for IoT authentication, and wireless protocol recognition for IoT situational awareness.
△ Less
Submitted 1 April, 2021;
originally announced June 2021.
-
Algorithm-Agnostic Explainability for Unsupervised Clustering
Authors:
Charles A. Ellis,
Mohammad S. E. Sendi,
Eloy P. T. Geenjaar,
Sergey M. Plis,
Robyn L. Miller,
Vince D. Calhoun
Abstract:
Supervised machine learning explainability has developed rapidly in recent years. However, clustering explainability has lagged behind. Here, we demonstrate the first adaptation of model-agnostic explainability methods to explain unsupervised clustering. We present two novel "algorithm-agnostic" explainability methods - global permutation percent change (G2PC) and local perturbation percent change…
▽ More
Supervised machine learning explainability has developed rapidly in recent years. However, clustering explainability has lagged behind. Here, we demonstrate the first adaptation of model-agnostic explainability methods to explain unsupervised clustering. We present two novel "algorithm-agnostic" explainability methods - global permutation percent change (G2PC) and local perturbation percent change (L2PC) - that identify feature importance globally to a clustering algorithm and locally to the clustering of individual samples. The methods are (1) easy to implement and (2) broadly applicable across clustering algorithms, which could make them highly impactful. We demonstrate the utility of the methods for explaining five popular clustering methods on low-dimensional synthetic datasets and on high-dimensional functional network connectivity data extracted from a resting-state functional magnetic resonance imaging dataset of 151 individuals with schizophrenia and 160 controls. Our results are consistent with existing literature while also shedding new light on how changes in brain connectivity may lead to schizophrenia symptoms. We further compare the explanations from our methods to an interpretable classifier and find them to be highly similar. Our proposed methods robustly explain multiple clustering algorithms and could facilitate new insights into many applications. We hope this study will greatly accelerate the development of the field of clustering explainability.
△ Less
Submitted 28 August, 2021; v1 submitted 17 May, 2021;
originally announced May 2021.
-
Reservoir-Based Distributed Machine Learning for Edge Operation
Authors:
Silvija Kokalj-Filipovic,
Paul Toliver,
William Johnson,
Rob Miller
Abstract:
We introduce a novel design for in-situ training of machine learning algorithms built into smart sensors, and illustrate distributed training scenarios using radio frequency (RF) spectrum sensors. Current RF sensors at the Edge lack the computational resources to support practical, in-situ training for intelligent signal classification. We propose a solution using Deepdelay Loop Reservoir Computin…
▽ More
We introduce a novel design for in-situ training of machine learning algorithms built into smart sensors, and illustrate distributed training scenarios using radio frequency (RF) spectrum sensors. Current RF sensors at the Edge lack the computational resources to support practical, in-situ training for intelligent signal classification. We propose a solution using Deepdelay Loop Reservoir Computing (DLR), a processing architecture that supports machine learning algorithms on resource-constrained edge-devices by leveraging delayloop reservoir computing in combination with innovative hardware. DLR delivers reductions in form factor, hardware complexity and latency, compared to the State-ofthe- Art (SoA) neural nets. We demonstrate DLR for two applications: RF Specific Emitter Identification (SEI) and wireless protocol recognition. DLR enables mobile edge platforms to authenticate and then track emitters with fast SEI retraining. Once delay loops separate the data classes, traditionally complex, power-hungry classification models are no longer needed for the learning process. Yet, even with simple classifiers such as Ridge Regression (RR), the complexity grows at least quadratically with the input size. DLR with a RR classifier exceeds the SoA accuracy, while further reducing power consumption by leveraging the architecture of parallel (split) loops. To authenticate mobile devices across large regions, DLR can be trained in a distributed fashion with very little additional processing and a small communication cost, all while maintaining accuracy. We illustrate how to merge locally trained DLR classifiers in use cases of interest.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
DomainNet: Homograph Detection for Data Lake Disambiguation
Authors:
Aristotelis Leventidis,
Laura Di Rocco,
Wolfgang Gatterbauer,
Renée J. Miller,
Mirek Riedewald
Abstract:
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we sh…
▽ More
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.
△ Less
Submitted 22 March, 2021; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Estimation of Cardiac Valve Annuli Motion with Deep Learning
Authors:
Eric Kerfoot,
Carlos Escudero King,
Tefvik Ismail,
David Nordsletten,
Renee Miller
Abstract:
Valve annuli motion and morphology, measured from non-invasive imaging, can be used to gain a better understanding of healthy and pathological heart function. Measurements such as long-axis strain as well as peak strain rates provide markers of systolic function. Likewise, early and late-diastolic filling velocities are used as indicators of diastolic function. Quantifying global strains, however,…
▽ More
Valve annuli motion and morphology, measured from non-invasive imaging, can be used to gain a better understanding of healthy and pathological heart function. Measurements such as long-axis strain as well as peak strain rates provide markers of systolic function. Likewise, early and late-diastolic filling velocities are used as indicators of diastolic function. Quantifying global strains, however, requires a fast and precise method of tracking long-axis motion throughout the cardiac cycle. Valve landmarks such as the insertion of leaflets into the myocardial wall provide features that can be tracked to measure global long-axis motion. Feature tracking methods require initialisation, which can be time-consuming in studies with large cohorts. Therefore, this study developed and trained a neural network to identify ten features from unlabeled long-axis MR images: six mitral valve points from three long-axis views, two aortic valve points and two tricuspid valve points. This study used manual annotations of valve landmarks in standard 2-, 3- and 4-chamber long-axis images collected in clinical scans to train the network. The accuracy in the identification of these ten features, in pixel distance, was compared with the accuracy of two commonly used feature tracking methods as well as the inter-observer variability of manual annotations. Clinical measures, such as valve landmark strain and motion between end-diastole and end-systole, are also presented to illustrate the utility and robustness of the method.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
On the Effects of Knowledge-Augmented Data in Word Embeddings
Authors:
Diego Ramirez-Echavarria,
Antonis Bikakis,
Luke Dickens,
Rob Miller,
Andreas Vlachidis
Abstract:
This paper investigates techniques for knowledge injection into word embeddings learned from large corpora of unannotated data. These representations are trained with word cooccurrence statistics and do not commonly exploit syntactic and semantic information from linguistic knowledge bases, which potentially limits their transferability to domains with differing language distributions or usages. W…
▽ More
This paper investigates techniques for knowledge injection into word embeddings learned from large corpora of unannotated data. These representations are trained with word cooccurrence statistics and do not commonly exploit syntactic and semantic information from linguistic knowledge bases, which potentially limits their transferability to domains with differing language distributions or usages. We propose a novel approach for linguistic knowledge injection through data augmentation to learn word embeddings that enforce semantic relationships from the data, and systematically evaluate the impact it has on the resulting representations. We show our knowledge augmentation approach improves the intrinsic characteristics of the learned embeddings while not significantly altering their results on a downstream text classification task.
△ Less
Submitted 4 October, 2020;
originally announced October 2020.
-
GeoTree: a data structure for constant time geospatial search enabling a real-time mix-adjusted median property price index
Authors:
Robert Miller,
Phil Maguire
Abstract:
A common problem appearing across the field of data science is $k$-NN ($k$-nearest neighbours), particularly within the context of Geographic Information Systems. In this article, we present a novel data structure, the GeoTree, which holds a collection of geohashes (string encodings of GPS co-ordinates). This enables a constant $O\left(1\right)$ time search algorithm that returns a set of geohashe…
▽ More
A common problem appearing across the field of data science is $k$-NN ($k$-nearest neighbours), particularly within the context of Geographic Information Systems. In this article, we present a novel data structure, the GeoTree, which holds a collection of geohashes (string encodings of GPS co-ordinates). This enables a constant $O\left(1\right)$ time search algorithm that returns a set of geohashes surrounding a given geohash in the GeoTree, representing the approximate $k$-nearest neighbours of that geohash. Furthermore, the GeoTree data structure retains $O\left(n\right)$ memory requirement. We apply the data structure to a property price index algorithm focused on price comparison with historical neighbouring sales, demonstrating an enhanced performance. The results show that this data structure allows for the development of a real-time property price index, and can be scaled to larger datasets with ease.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Knowledge Translation: Extended Technical Report
Authors:
Bahar Ghadiri Bashardoost,
Renée J. Miller,
Kelly Lyons,
Fatemeh Nargesian
Abstract:
We introduce Kensho, a tool for generating map** rules between two Knowledge Bases (KBs). To create the map** rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are…
▽ More
We introduce Kensho, a tool for generating map** rules between two Knowledge Bases (KBs). To create the map** rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are captured. In the second phase, map** rules are generated by interpreting the correspondences in a way that respects the discovered semantic associations among elements of each KB. Kensho's map** rules are expressed using SPARQL queries and can be used directly to exchange knowledge from source to target. Kensho is able to automatically rank the generated map** rules using a set of heuristics. We present an experimental evaluation of Kensho and assess our map** generation and ranking strategies using more than 50 synthesized and real world settings, chosen to showcase some of the most important applications of knowledge translation. In addition, we use three existing benchmarks to demonstrate Kensho's ability to deal with different map** scenarios.
△ Less
Submitted 3 August, 2020;
originally announced August 2020.
-
A blockchain-orchestrated Federated Learning architecture for healthcare consortia
Authors:
Jonathan Passerat-Palmbach,
Tyler Farnan,
Robert Miller,
Marielle S. Gross,
Heather Leigh Flannery,
Bill Gleim
Abstract:
We propose a novel architecture for federated learning within healthcare consortia. At the heart of the solution is a unique integration of privacy preserving technologies, built upon native enterprise blockchain components available in the Ethereum ecosystem. We show how the specific characteristics and challenges of healthcare consortia informed our design choices, notably the conception of a ne…
▽ More
We propose a novel architecture for federated learning within healthcare consortia. At the heart of the solution is a unique integration of privacy preserving technologies, built upon native enterprise blockchain components available in the Ethereum ecosystem. We show how the specific characteristics and challenges of healthcare consortia informed our design choices, notably the conception of a new Secure Aggregation protocol assembled with a protected hardware component and an encryption toolkit native to Ethereum. Our architecture also brings in a privacy preserving audit trail that logs events in the network without revealing identities.
△ Less
Submitted 12 October, 2019;
originally announced October 2019.
-
Scientific Statement Classification over arXiv.org
Authors:
Deyan Ginev,
Bruce R. Miller
Abstract:
We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the ar** 10.5 million annotated paragraphs into thirteen classes. We demonst…
▽ More
We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the ar** 10.5 million annotated paragraphs into thirteen classes. We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both data and task design, and outline potential directions towards increasingly complex models of scientific discourse, beyond isolated statements.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
Hardware-In-the-Loop for Connected Automated Vehicles Testing in Real Traffic
Authors:
Yeojun Kim,
Samuel Tay,
Jacopo Guanetti,
Francesco Borrelli,
Ryan Miller
Abstract:
We present a hardware-in-the-loop (HIL) simulation setup for repeatable testing of Connected Automated Vehicles (CAVs) in dynamic, real-world scenarios. Our goal is to test control and planning algorithms and their distributed implementation on the vehicle hardware and, possibly, in the cloud. The HIL setup combines PreScan for perception sensors, road topography, and signalized intersections; Vis…
▽ More
We present a hardware-in-the-loop (HIL) simulation setup for repeatable testing of Connected Automated Vehicles (CAVs) in dynamic, real-world scenarios. Our goal is to test control and planning algorithms and their distributed implementation on the vehicle hardware and, possibly, in the cloud. The HIL setup combines PreScan for perception sensors, road topography, and signalized intersections; Vissim for traffic micro-simulation; ETAS DESK-LABCAR/a dynamometer for vehicle and powertrain dynamics; and on-board electronic control units for CAV real time control. Models of traffic and signalized intersections are driven by real-world measurements. To demonstrate this HIL simulation setup, we test a Model Predictive Control approach for maximizing energy efficiency of CAVs in urban environments.
△ Less
Submitted 21 July, 2019;
originally announced July 2019.
-
AutoEncoders for Training Compact Deep Learning RF Classifiers for Wireless Protocols
Authors:
Silvija Kokalj-Filipovic,
Rob Miller,
Joshua Morman
Abstract:
We show that compact fully connected (FC) deep learning networks trained to classify wireless protocols using a hierarchy of multiple denoising autoencoders (AEs) outperform reference FC networks trained in a typical way, i.e., with a stochastic gradient based optimization of a given FC architecture. Not only is the complexity of such FC network, measured in number of trainable parameters and scal…
▽ More
We show that compact fully connected (FC) deep learning networks trained to classify wireless protocols using a hierarchy of multiple denoising autoencoders (AEs) outperform reference FC networks trained in a typical way, i.e., with a stochastic gradient based optimization of a given FC architecture. Not only is the complexity of such FC network, measured in number of trainable parameters and scalar multiplications, much lower than the reference FC and residual models, its accuracy also outperforms both models for nearly all tested SNR values (0 dB to 50dB). Such AE-trained networks are suited for in-situ protocol inference performed by simple mobile devices based on noisy signal measurements. Training is based on the data transmitted by real devices, and collected in a controlled environment, and systematically augmented by a policy-based data synthesis process by adding to the signal any subset of impairments commonly seen in a wireless receiver.
△ Less
Submitted 12 April, 2019;
originally announced April 2019.
-
Explaining Anomalies Detected by Autoencoders Using SHAP
Authors:
Liat Antwarg,
Ronnie Mindlin Miller,
Bracha Shapira,
Lior Rokach
Abstract:
Anomaly detection algorithms are often thought to be limited because they don't facilitate the process of validating results performed by domain experts. In Contrast, deep learning algorithms for anomaly detection, such as autoencoders, point out the outliers, saving experts the time-consuming task of examining normal cases in order to find anomalies. Most outlier detection algorithms output a sco…
▽ More
Anomaly detection algorithms are often thought to be limited because they don't facilitate the process of validating results performed by domain experts. In Contrast, deep learning algorithms for anomaly detection, such as autoencoders, point out the outliers, saving experts the time-consuming task of examining normal cases in order to find anomalies. Most outlier detection algorithms output a score for each instance in the database. The top-k most intense outliers are returned to the user for further inspection; however the manual validation of results becomes challenging without additional clues. An explanation of why an instance is anomalous enables the experts to focus their investigation on most important anomalies and may increase their trust in the algorithm.
Recently, a game theory-based framework known as SHapley Additive exPlanations (SHAP) has been shown to be effective in explaining various supervised learning models. In this research, we extend SHAP to explain anomalies detected by an autoencoder, an unsupervised model. The proposed method extracts and visually depicts both the features that most contributed to the anomaly and those that offset it. A preliminary experimental study using real world data demonstrates the usefulness of the proposed method in assisting the domain experts to understand the anomaly and filtering out the uninteresting anomalies, aiming at minimizing the false positive rate of detected anomalies.
△ Less
Submitted 30 June, 2020; v1 submitted 6 March, 2019;
originally announced March 2019.
-
Mitigation of Adversarial Examples in RF Deep Classifiers Utilizing AutoEncoder Pre-training
Authors:
Silvija Kokalj-Filipovic,
Rob Miller,
Nicholas Chang,
Chi Leung Lau
Abstract:
Adversarial examples in machine learning for images are widely publicized and explored. Illustrations of misclassifications caused by slightly perturbed inputs are abundant and commonly known (e.g., a picture of panda imperceptibly perturbed to fool the classifier into incorrectly labeling it as a gibbon). Similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation…
▽ More
Adversarial examples in machine learning for images are widely publicized and explored. Illustrations of misclassifications caused by slightly perturbed inputs are abundant and commonly known (e.g., a picture of panda imperceptibly perturbed to fool the classifier into incorrectly labeling it as a gibbon). Similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation strategies are scarcely addressed in the published work. Yet, RF adversarial examples (AdExs) with minimal waveform perturbations can cause drastic, targeted misclassification results, particularly against spectrum sensing/survey applications (e.g. BPSK is mistaken for 8-PSK). Our research on deep learning AdExs and proposed defense mechanisms are RF-centric, and incorporate physical world, over-the-air (OTA) effects. We herein present defense mechanisms based on pre-training the target classifier using an autoencoder. Our results validate this approach as a viable mitigation method to subvert adversarial attacks against deep learning-based communications and radar sensing systems.
△ Less
Submitted 16 February, 2019;
originally announced February 2019.
-
Adversarial Examples in RF Deep Learning: Detection of the Attack and its Physical Robustness
Authors:
Silvija Kokalj-Filipovic,
Rob Miller
Abstract:
While research on adversarial examples in machine learning for images has been prolific, similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation strategies are scarcely addressed in the published work, with only one recent publication in the RF domain [1]. RF adversarial examples (AdExs) can cause drastic, targeted misclassification results mostly in spectrum se…
▽ More
While research on adversarial examples in machine learning for images has been prolific, similar attacks on deep learning (DL) for radio frequency (RF) signals and their mitigation strategies are scarcely addressed in the published work, with only one recent publication in the RF domain [1]. RF adversarial examples (AdExs) can cause drastic, targeted misclassification results mostly in spectrum sensing/ survey applications (e.g. BPSK mistaken for 8-PSK) with minimal waveform perturbation. It is not clear if the RF AdExs maintain their effects in the physical world, i.e., when AdExs are delivered over-the-air (OTA). Our research on deep learning AdExs and proposed defense mechanisms are RF-centric, and incorporate physical world, OTA effects. We here present defense mechanisms based on statistical tests. One test to detect AdExs utilizes Peak-to- Average-Power-Ratio (PAPR) of the DL data points delivered OTA, while another statistical test uses the Softmax outputs of the DL classifier, which corresponds to the probabilities the classifier assigns to each of the trained classes. The former test leverages the RF nature of the data, and the latter is universally applicable to AdExs regardless of their origin. Both solutions are shown as viable mitigation methods to subvert adversarial attacks against communications and radar sensing systems.
△ Less
Submitted 16 February, 2019;
originally announced February 2019.
-
Data Lake Organization
Authors:
Fatemeh Nargesian,
Ken Q. Pu,
Bahar Ghadiri Bashardoost,
Erkang Zhu,
Renée J. Miller
Abstract:
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user f…
▽ More
We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding a table using the organization. We propose the data lake organization problem as the problem of finding an organization that maximizes the expected probability of discovering tables by navigating an organization. We propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users discover relevant tables that cannot be found by keyword search. In addition, in our study, 42% of users preferred the use of navigation and 58% preferred keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes. Our experiments show that data lake organizations take into account the data lake distribution and outperform an existing hand-curated taxonomy and a common baseline organization.
△ Less
Submitted 2 March, 2020; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
Authors:
Spyridon Bakas,
Mauricio Reyes,
Andras Jakab,
Stefan Bauer,
Markus Rempfler,
Alessandro Crimi,
Russell Takeshi Shinohara,
Christoph Berger,
Sung Min Ha,
Martin Rozycki,
Marcel Prastawa,
Esther Alberts,
Jana Lipkova,
John Freymann,
Justin Kirby,
Michel Bilello,
Hassan Fathallah-Shaykh,
Roland Wiest,
Jan Kirschke,
Benedikt Wiestler,
Rivka Colen,
Aikaterini Kotrotsou,
Pamela Lamontagne,
Daniel Marcus,
Mikhail Milchenko
, et al. (402 additional authors not shown)
Abstract:
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem…
▽ More
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.
△ Less
Submitted 23 April, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Reduced-Order Modeling through Machine Learning Approaches for Brittle Fracture Applications
Authors:
A. Hunter,
B. A. Moore,
M. K. Mudunuru,
V. T. Chau,
R. L. Miller,
R. B. Tchoua,
C. Nyshadham,
S. Karra,
D. O. Malley,
E. Rougier,
H. S. Viswanathan,
G. Srinivasan
Abstract:
In this paper, five different approaches for reduced-order modeling of brittle fracture in geomaterials, specifically concrete, are presented and compared. Four of the five methods rely on machine learning (ML) algorithms to approximate important aspects of the brittle fracture problem. In addition to the ML algorithms, each method incorporates different physics-based assumptions in order to reduc…
▽ More
In this paper, five different approaches for reduced-order modeling of brittle fracture in geomaterials, specifically concrete, are presented and compared. Four of the five methods rely on machine learning (ML) algorithms to approximate important aspects of the brittle fracture problem. In addition to the ML algorithms, each method incorporates different physics-based assumptions in order to reduce the computational complexity while maintaining the physics as much as possible. This work specifically focuses on using the ML approaches to model a 2D concrete sample under low strain rate pure tensile loading conditions with 20 preexisting cracks present. A high-fidelity finite element-discrete element model is used to both produce a training dataset of 150 simulations and an additional 35 simulations for validation. Results from the ML approaches are directly compared against the results from the high-fidelity model. Strengths and weaknesses of each approach are discussed and the most important conclusion is that a combination of physics-informed and data-driven features are necessary for emulating the physics of crack propagation, interaction and coalescence. All of the models presented here have runtimes that are orders of magnitude faster than the original high-fidelity model and pave the path for develo** accurate reduced order models that could be used to inform larger length-scale models with important sub-scale physics that often cannot be accounted for due to computational cost.
△ Less
Submitted 5 June, 2018;
originally announced June 2018.
-
Rotation Blurring: Use of Artificial Blurring to Reduce Cybersickness in Virtual Reality First Person Shooters
Authors:
Pulkit Budhiraja,
Mark Roman Miller,
Abhishek K Modi,
David Forsyth
Abstract:
Users of Virtual Reality (VR) systems often experience vection, the perception of self-motion in the absence of any physical movement. While vection helps to improve presence in VR, it often leads to a form of motion sickness called cybersickness. Cybersickness is a major deterrent to large scale adoption of VR.
Prior work has discovered that changing vection (changing the perceived speed or mov…
▽ More
Users of Virtual Reality (VR) systems often experience vection, the perception of self-motion in the absence of any physical movement. While vection helps to improve presence in VR, it often leads to a form of motion sickness called cybersickness. Cybersickness is a major deterrent to large scale adoption of VR.
Prior work has discovered that changing vection (changing the perceived speed or moving direction) causes more severe cybersickness than steady vection (walking at a constant speed or in a constant direction). Based on this idea, we try to reduce the cybersickness caused by character movements in a First Person Shooter (FPS) game in VR. We propose Rotation Blurring (RB), uniformly blurring the screen during rotational movements to reduce cybersickness. We performed a user study to evaluate the impact of RB in reducing cybersickness. We found that the blurring technique led to an overall reduction in sickness levels of the participants and delayed its onset. Participants who experienced acute levels of cybersickness benefited significantly from this technique.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
Foundations for a Probabilistic Event Calculus
Authors:
Fabio Aurelio D'Asaro,
Antonis Bikakis,
Luke Dickens,
Rob Miller
Abstract:
We present PEC, an Event Calculus (EC) style action language for reasoning about probabilistic causal and narrative information. It has an action language style syntax similar to that of the EC variant Modular-E. Its semantics is given in terms of possible worlds which constitute possible evolutions of the domain, and builds on that of EFEC, an epistemic extension of EC. We also describe an ASP im…
▽ More
We present PEC, an Event Calculus (EC) style action language for reasoning about probabilistic causal and narrative information. It has an action language style syntax similar to that of the EC variant Modular-E. Its semantics is given in terms of possible worlds which constitute possible evolutions of the domain, and builds on that of EFEC, an epistemic extension of EC. We also describe an ASP implementation of PEC and show the sense in which this is sound and complete.
△ Less
Submitted 30 June, 2017; v1 submitted 20 March, 2017;
originally announced March 2017.
-
A Collective, Probabilistic Approach to Schema Map**: Appendix
Authors:
Angelika Kimmig,
Alex Memory,
Renee J. Miller,
Lise Getoor
Abstract:
In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Map**." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper.
In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Map**." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper.
△ Less
Submitted 11 February, 2017;
originally announced February 2017.
-
LSH Ensemble: Internet-Scale Domain Search
Authors:
Erkang Zhu,
Fatemeh Nargesian,
Ken Q. Pu,
Renée J. Miller
Abstract:
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarit…
▽ More
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes.
We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches (MinHash) and domain partitioning. Our index structure does not assume a prescribed set of values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the partitioning for LSH Ensemble as an optimization problem. We prove that there exists an optimal partitioning for any distribution. Furthermore, for datasets following a power-law distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it efficient to use in practice.
We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 M domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 M domains, our index sustains query performance with under 3 seconds response time.
△ Less
Submitted 23 July, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Strategies for Parallel Markup
Authors:
Bruce R. Miller
Abstract:
Cross-referenced parallel markup for mathematics allows the combination of both presentation and content representations while associating the components of each. Interesting applications are enabled by such an arrangement, such as interaction with parts of the presentation to manipulate and querying the corresponding content, and enhanced search indexing. Although the idea of such markup is hardl…
▽ More
Cross-referenced parallel markup for mathematics allows the combination of both presentation and content representations while associating the components of each. Interesting applications are enabled by such an arrangement, such as interaction with parts of the presentation to manipulate and querying the corresponding content, and enhanced search indexing. Although the idea of such markup is hardly new, effective techniques for creating and manipulating it are more difficult than it appears. Since the structures and tokens in the two formats often do not correspond one-to-one, decisions and heuristics must be developed to determine in which way each component refers to and is referred to by components of the other representation. Conversion between fine and coarse grained parallel markup complicates ID assignments. In this paper, we will describe the techniques developed for \LaTeXML, a \TeX/\LaTeX to XML converter, to create cross-referenced parallel MathML. While we do not yet consider \LaTeXML's content MathML to be useful, the current effort is a step towards that continuing goal.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.