-
An Annotated Glossary for Data Commons, Data Meshes, and Other Data Platforms
Authors:
Robert L. Grossman
Abstract:
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Enhancing Instance-Level Image Classification with Set-Level Labels
Authors:
Renyu Zhang,
Aly A. Khan,
Yuxin Chen,
Robert L. Grossman
Abstract:
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveragin…
▽ More
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.
△ Less
Submitted 17 November, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Knot Mosaics with Corner Connection Tiles
Authors:
Aaron Heap,
Una Donovan,
Riley Grossman,
Nickolas Laine,
Connor McDermott,
Marcus Paone,
Drew Southcott
Abstract:
A knot mosaic is a representation of a knot or link on a square grid using a collection of tiles that are either blank or contain a portion of the knot diagram. Traditionally, a piece of the knot on one tile connects to a piece of the knot on an adjacent tile at a connection point that is located at the midpoint of a tile edge. We introduce a new set of tiles in which the connection points are loc…
▽ More
A knot mosaic is a representation of a knot or link on a square grid using a collection of tiles that are either blank or contain a portion of the knot diagram. Traditionally, a piece of the knot on one tile connects to a piece of the knot on an adjacent tile at a connection point that is located at the midpoint of a tile edge. We introduce a new set of tiles in which the connection points are located at corners of the tile. By doing this, we can create more efficient knot mosaics for knots with small crossing number. In particular, when using these corner connection tiles, it is possible to create knot mosaic diagrams for all knots with crossing number 8 or less on a mosaic that is no larger and uses fewer non-blank tiles than is possible using the traditional tiles.
△ Less
Submitted 2 April, 2024; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Principles and Guidelines for Sharing Biomedical Data for Secondary Use: The University of Chicago Perspective
Authors:
Robert L. Grossman,
Maryellen L. Giger,
Julie A. Johnson,
Jeremy D. Marks,
Jessica P. Ridgway,
Julian Solway,
Walter M. Stadler
Abstract:
Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patien…
▽ More
Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patients may become available if the data is shared with third parties and the third parties have a data breach or violate the terms of the data use agreement. Balancing the importance of research that may result in improved patient outcomes with the importance of protecting patient data is challenging. The article discusses the principles, considerations about risks and mitigating risks, and guidelines used at the University of Chicago used for making decisions about sharing biomedical data with third parties.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
Deep Learning Generates Synthetic Cancer Histology for Explainability and Education
Authors:
James M. Dolezal,
Rachelle Wolk,
Hanna M. Hieromnimon,
Frederick M. Howard,
Andrew Srisuwananukorn,
Dmitry Karpeyev,
Siddhi Ramesh,
Sara Kochanny,
Jung Woo Kwon,
Meghana Agni,
Richard C. Simon,
Chandni Desai,
Raghad Kherallah,
Tung D. Nguyen,
Jefree J. Schulte,
Kimberly Cole,
Galina Khramtsova,
Marina Chiara Garassino,
Aliya N. Husain,
Huihua Li,
Robert Grossman,
Nicole A. Cipriani,
Alexander T. Pearson
Abstract:
Artificial intelligence methods including deep neural networks (DNN) can provide rapid molecular classification of tumors from routine histology with accuracy that matches or exceeds human pathologists. Discerning how neural networks make their predictions remains a significant challenge, but explainability tools help provide insights into what models have learned when corresponding histologic fea…
▽ More
Artificial intelligence methods including deep neural networks (DNN) can provide rapid molecular classification of tumors from routine histology with accuracy that matches or exceeds human pathologists. Discerning how neural networks make their predictions remains a significant challenge, but explainability tools help provide insights into what models have learned when corresponding histologic features are poorly defined. Here, we present a method for improving explainability of DNN models using synthetic histology generated by a conditional generative adversarial network (cGAN). We show that cGANs generate high-quality synthetic histology images that can be leveraged for explaining DNN models trained to classify molecularly-subtyped tumors, exposing histologic features associated with molecular state. Fine-tuning synthetic histology through class and layer blending illustrates nuanced morphologic differences between tumor subtypes. Finally, we demonstrate the use of synthetic histology for augmenting pathologist-in-training education, showing that these intuitive visualizations can reinforce and improve understanding of histologic manifestations of tumor biology.
△ Less
Submitted 9 December, 2022; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Ten Lessons for Data Sharing With a Data Commons
Authors:
Robert L. Grossman
Abstract:
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past…
▽ More
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past decade, a number of data commons have been developed and we discuss some of the lessons learned from this effort.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
A Framework for the Interoperability of Cloud Platforms: Towards FAIR Data in SAFE Environments
Authors:
Robert L. Grossman,
Rebecca R. Boyles,
Brandi N. Davis-Dusenbery,
Amanda Haddock,
Allison P. Heath,
Brian D. O'Connor,
Adam C. Resnick,
Deanne M. Taylor,
Stan Ahalt
Abstract:
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies…
▽ More
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies to cloud-based computing environments that we call a Secure and Authorized FAIR Environment (SAFE). SAFE environments require data and platform governance structures and are designed to support the interoperability of sensitive or controlled access data, such as biomedical data. A SAFE environment is a cloud platform that has been approved through a defined data and platform governance process as authorized to hold data from another cloud platform and exposes appropriate APIs for the two platforms to interoperate.
△ Less
Submitted 15 February, 2024; v1 submitted 9 March, 2022;
originally announced March 2022.
-
The Absurdity of Death Estimates Based on the Vaccine Adverse Event Reporting System
Authors:
Gordon V Cormack,
Maura R Grossman
Abstract:
We demonstrate from first principles a core fallacy employed by a coterie of authors who claim that data from the Vaccine Adverse Reporting System (VAERS) show that hundreds of thousands of U.S. deaths are attributable to COVID vaccination.
We demonstrate from first principles a core fallacy employed by a coterie of authors who claim that data from the Vaccine Adverse Reporting System (VAERS) show that hundreds of thousands of U.S. deaths are attributable to COVID vaccination.
△ Less
Submitted 8 February, 2022;
originally announced February 2022.
-
Scalable Batch-Mode Deep Bayesian Active Learning via Equivalence Class Annealing
Authors:
Renyu Zhang,
Aly A. Khan,
Robert L. Grossman,
Yuxin Chen
Abstract:
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines in…
▽ More
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALanCe relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low- and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.
△ Less
Submitted 20 February, 2023; v1 submitted 27 December, 2021;
originally announced December 2021.
-
The eDiscovery Medicine Show
Authors:
Maura R. Grossman,
Gordon V. Cormack
Abstract:
The practice of bloodletting gradually fell into disfavor as a growing body of scientific evidence showed its ineffectiveness and demonstrated the effectiveness of various pharmaceuticals for the prevention and treatment of certain diseases. At the same time, the patent medicine industry promoted ineffective remedies at medicine shows featuring entertainment, testimonials, and pseudo-scientific cl…
▽ More
The practice of bloodletting gradually fell into disfavor as a growing body of scientific evidence showed its ineffectiveness and demonstrated the effectiveness of various pharmaceuticals for the prevention and treatment of certain diseases. At the same time, the patent medicine industry promoted ineffective remedies at medicine shows featuring entertainment, testimonials, and pseudo-scientific claims with all the trap**s--but none of the methodology--of science. Today, many producing parties and eDiscovery vendors similarly promote obsolete technology as well as unvetted tools labeled "artificial intelligence" or "technology-assisted review," along with unsound validation protocols. This situation will end only when eDiscovery technologies and tools are subject to testing using the methods of information retrieval.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
Participation in TREC 2020 COVID Track Using Continuous Active Learning
Authors:
Xue Jun Wang,
Maura R. Grossman,
Seung Gyu Hyun
Abstract:
We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collecti…
▽ More
We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collection CORD-19. Our solution addressed this challenge by applying the Continuous Active Learning model (CAL) and its variations. Our results showed us to be amongst the top scoring manual runs and we remained competitive within all categories of submissions.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
The realization of input-output maps using bialgebras
Authors:
Robert L. Grossman,
Richard G. Larson
Abstract:
We use the theory of bialgebras to provide the algebraic background for state space realization theorems for input-output maps of control systems. This allows us to consider from a common viewpoint classical results about formal state space realizations of nonlinear systems and more recent results involving analysis related to families of trees. If $H$ is a bialgebra, we say that $p \in H^*$ is di…
▽ More
We use the theory of bialgebras to provide the algebraic background for state space realization theorems for input-output maps of control systems. This allows us to consider from a common viewpoint classical results about formal state space realizations of nonlinear systems and more recent results involving analysis related to families of trees. If $H$ is a bialgebra, we say that $p \in H^*$ is differentially produced by the algebra $R$ with the augmentation $ε$ if there is right $H$-module algebra structure on $R$ and there exists $f \in R$ satisfying $p(h) = ε(f \cdot h)$. We characterize those $p \in H^*$ which are differentially produced.
△ Less
Submitted 18 July, 2020;
originally announced July 2020.
-
Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data
Authors:
Robert L. Grossman
Abstract:
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interopera…
▽ More
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing and sharing genomic data, with an emphasis on data commons, but also covering data ecosystems and data lakes.
△ Less
Submitted 24 December, 2018; v1 submitted 5 September, 2018;
originally announced September 2018.
-
Evaluating Sentence-Level Relevance Feedback for High-Recall Information Retrieval
Authors:
Haotian Zhang,
Gordon V. Cormack,
Maura R. Grossman,
Mark D. Smucker
Abstract:
This study uses a novel simulation framework to evaluate whether the time and effort necessary to achieve high recall using active learning is reduced by presenting the reviewer with isolated sentences, as opposed to full documents, for relevance feedback. Under the weak assumption that more time and effort is required to review an entire document than a single sentence, simulation results indicat…
▽ More
This study uses a novel simulation framework to evaluate whether the time and effort necessary to achieve high recall using active learning is reduced by presenting the reviewer with isolated sentences, as opposed to full documents, for relevance feedback. Under the weak assumption that more time and effort is required to review an entire document than a single sentence, simulation results indicate that the use of isolated sentences for relevance feedback can yield comparable accuracy and higher efficiency, relative to the state-of-the-art Baseline Model Implementation (BMI) of the AutoTAR Continuous Active Learning ("CAL") method employed in the TREC 2015 and 2016 Total Recall Track.
△ Less
Submitted 27 March, 2019; v1 submitted 23 March, 2018;
originally announced March 2018.
-
Impact of Feature Selection on Micro-Text Classification
Authors:
Ankit Vadehra,
Maura R. Grossman,
Gordon V. Cormack
Abstract:
Social media datasets, especially Twitter tweets, are popular in the field of text classification. Tweets are a valuable source of micro-text (sometimes referred to as "micro-blogs"), and have been studied in domains such as sentiment analysis, recommendation systems, spam detection, clustering, among others. Tweets often include keywords referred to as "Hashtags" that can be used as labels for th…
▽ More
Social media datasets, especially Twitter tweets, are popular in the field of text classification. Tweets are a valuable source of micro-text (sometimes referred to as "micro-blogs"), and have been studied in domains such as sentiment analysis, recommendation systems, spam detection, clustering, among others. Tweets often include keywords referred to as "Hashtags" that can be used as labels for the tweet. Using tweets encompassing 50 labels, we studied the impact of word versus character-level feature selection and extraction on different learners to solve a multi-class classification task. We show that feature extraction of simple character-level groups performs better than simple word groups and pre-processing methods like normalizing using Porter's Stemming and Part-of-Speech ("POS")-Lemmatization.
△ Less
Submitted 27 August, 2017;
originally announced August 2017.
-
Detecting Spatial Patterns of Disease in Large Collections of Electronic Medical Records Using Neighbor-Based Bootstrap** (NB2)
Authors:
Maria T Patterson,
Robert L Grossman
Abstract:
We introduce a method called neighbor-based bootstrap** (NB2) that can be used to quantify the geospatial variation of a variable. We applied this method to an analysis of the incidence rates of disease from electronic medical record data (ICD-9 codes) for approximately 100 million individuals in the US over a period of 8 years. We considered the incidence rate of disease in each county and its…
▽ More
We introduce a method called neighbor-based bootstrap** (NB2) that can be used to quantify the geospatial variation of a variable. We applied this method to an analysis of the incidence rates of disease from electronic medical record data (ICD-9 codes) for approximately 100 million individuals in the US over a period of 8 years. We considered the incidence rate of disease in each county and its geospatially contiguous neighbors and rank ordered diseases in terms of their degree of geospatial variation as quantified by the NB2 method.
We show that this method yields results in good agreement with established methods for detecting spatial autocorrelation (Moran's I method and kriging). Moreover, the NB2 method can be tuned to identify both large area and small area geospatial variations. This method also applies more generally in any parameter space that can be partitioned to consist of regions and their neighbors.
△ Less
Submitted 5 March, 2017;
originally announced March 2017.
-
A Case for Data Commons: Towards Data Science as a Service
Authors:
Robert L. Grossman,
Allison Heath,
Mark Murphy,
Maria Patterson,
Walt Wells
Abstract:
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scienti…
▽ More
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience develo** data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and develo** appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.
△ Less
Submitted 9 April, 2016;
originally announced April 2016.
-
The Matsu Wheel: A Cloud-based Framework for Efficient Analysis and Reanalysis of Earth Satellite Imagery
Authors:
Maria T Patterson,
Nikolas Anderson,
Collin Bennett,
Jacob Bruggemann,
Robert Grossman,
Matthew Handy,
Vuong Ly,
Dan Mandl,
Shane Pederson,
Jim Pivarski,
Ray Powell,
Jonathan Spring,
Walt Wells
Abstract:
Project Matsu is a collaboration between the Open Commons Consortium and NASA focused on develo** open source technology for the cloud-based processing of Earth satellite imagery. A particular focus is the development of applications for detecting fires and floods to help support natural disaster detection and relief. Project Matsu has developed an open source cloud-based infrastructure to proce…
▽ More
Project Matsu is a collaboration between the Open Commons Consortium and NASA focused on develo** open source technology for the cloud-based processing of Earth satellite imagery. A particular focus is the development of applications for detecting fires and floods to help support natural disaster detection and relief. Project Matsu has developed an open source cloud-based infrastructure to process, analyze, and reanalyze large collections of hyperspectral satellite image data using OpenStack, Hadoop, MapReduce, Storm and related technologies.
We describe a framework for efficient analysis of large amounts of data called the Matsu "Wheel." The Matsu Wheel is currently used to process incoming hyperspectral satellite data produced daily by NASA's Earth Observing-1 (EO-1) satellite. The framework is designed to be able to support scanning queries using cloud computing applications, such as Hadoop and Accumulo. A scanning query processes all, or most of the data, in a database or data repository.
We also describe our preliminary Wheel analytics, including an anomaly detector for rare spectral signatures or thermal anomalies in hyperspectral data and a land cover classifier that can be used for water and flood detection. Each of these analytics can generate visual reports accessible via the web for the public and interested decision makers. The resultant products of the analytics are also made accessible through an Open Geospatial Compliant (OGC)-compliant Web Map Service (WMS) for further distribution. The Matsu Wheel allows many shared data services to be performed together to efficiently use resources for processing hyperspectral satellite image data and other, e.g., large environmental datasets that may be analyzed for many purposes.
△ Less
Submitted 22 February, 2016;
originally announced February 2016.
-
The Design of a Community Science Cloud: The Open Science Data Cloud Perspective
Authors:
Robert L. Grossman,
Matthew Greenway,
Allison P. Heath,
Ray Powell,
Rafael D. Suarez,
Walt Wells,
Kevin White,
Malcolm Atkinson,
Iraklis Klampanos,
Heidi L. Alvarez,
Christine Harvey,
Joe J. Mambretti
Abstract:
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss som…
▽ More
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss some of the lessons learned during the past three years of operation and describe the software stacks used in the OSDC. We also describe some of the research projects in biology, the earth sciences, and social sciences enabled by the OSDC.
△ Less
Submitted 3 January, 2016;
originally announced January 2016.
-
Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review
Authors:
Gordon V. Cormack,
Maura R. Grossman
Abstract:
We enhance the autonomy of the continuous active learning method shown by Cormack and Grossman (SIGIR 2014) to be effective for technology-assisted review, in which documents from a collection are retrieved and reviewed, using relevance feedback, until substantially all of the relevant documents have been reviewed. Autonomy is enhanced through the elimination of topic-specific and dataset-specific…
▽ More
We enhance the autonomy of the continuous active learning method shown by Cormack and Grossman (SIGIR 2014) to be effective for technology-assisted review, in which documents from a collection are retrieved and reviewed, using relevance feedback, until substantially all of the relevant documents have been reviewed. Autonomy is enhanced through the elimination of topic-specific and dataset-specific tuning parameters, so that the sole input required by the user is, at the outset, a short query, topic description, or single relevant document; and, throughout the review, ongoing relevance assessments of the retrieved documents. We show that our enhancements consistently yield superior results to Cormack and Grossman's version of continuous active learning, and other methods, not only on average, but on the vast majority of topics from four separate sets of tasks: the legal datasets examined by Cormack and Grossman, the Reuters RCV1-v2 subject categories, the TREC 6 AdHoc task, and the construction of the TREC 2002 filtering test collection.
△ Less
Submitted 26 April, 2015;
originally announced April 2015.
-
Faster computation of adiabatic EMRIs using resonances
Authors:
Rebecca Grossman,
Janna Levin,
Gabe Perez-Giz
Abstract:
Motivated by the prohibitive computational cost of producing adiabatic extreme mass ratio inspirals, we explain how a judicious use of resonant orbits can dramatically expedite both that calculation and the generation of snapshot gravitational waves from geodesic sources. In the course of our argument, we clarify the resolution of a lingering debate on the appropriate adiabatic averaging prescript…
▽ More
Motivated by the prohibitive computational cost of producing adiabatic extreme mass ratio inspirals, we explain how a judicious use of resonant orbits can dramatically expedite both that calculation and the generation of snapshot gravitational waves from geodesic sources. In the course of our argument, we clarify the resolution of a lingering debate on the appropriate adiabatic averaging prescription in favor of torus averaging over time averaging.
△ Less
Submitted 8 August, 2011;
originally announced August 2011.
-
The harmonic structure of generic Kerr orbits
Authors:
Rebecca Grossman,
Janna Levin,
Gabe Perez-Giz
Abstract:
Generic Kerr orbits exhibit intricate three-dimensional motion. We offer a classification scheme for these intricate orbits in terms of periodic orbits. The crucial insight is that for a given effective angular momentum $L$ and angle of inclination $ι$, there exists a discrete set of orbits that are geometrically $n$-leaf clovers in a precessing {\it orbital plane}. When viewed in the full three d…
▽ More
Generic Kerr orbits exhibit intricate three-dimensional motion. We offer a classification scheme for these intricate orbits in terms of periodic orbits. The crucial insight is that for a given effective angular momentum $L$ and angle of inclination $ι$, there exists a discrete set of orbits that are geometrically $n$-leaf clovers in a precessing {\it orbital plane}. When viewed in the full three dimensions, these orbits are periodic in $r-θ$. Each $n$-leaf clover is associated with a rational number, $1+q_{rθ}=ω_θ/ω_r$, that measures the degree of perihelion precession in the precessing orbital plane. The rational number $q_{rθ}$ varies monotonically with the orbital energy and with the orbital eccentricity. Since any bound orbit can be approximated as near one of these periodic $n$-leaf clovers, this special set offers a skeleton that illuminates the structure of all bound Kerr orbits, in or out of the equatorial plane.
△ Less
Submitted 29 May, 2011;
originally announced May 2011.
-
MalStone: Towards A Benchmark for Analytics on Large Data Clouds
Authors:
Collin Bennett,
Robert L. Grossman,
David Locke,
Jonathan Seidman,
Steve Vejcik
Abstract:
Develo** data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is develo** cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper…
▽ More
Develo** data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is develo** cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper, we introduce a benchmark called MalStone that is specifically designed to measure the performance of cloud computing middleware that supports the type of data intensive computing common when building data mining models. We also introduce MalGen, which is a utility for generating data on clouds that can be used with MalStone.
△ Less
Submitted 7 July, 2010;
originally announced July 2010.
-
The Open Cloud Testbed: A Wide Area Testbed for Cloud Computing Utilizing High Performance Network Services
Authors:
Robert Grossman,
Yunhong Gu,
Michal Sabala,
Collin Bennet,
Jonathan Seidman,
Joe Mambratti
Abstract:
Recently, a number of cloud platforms and services have been developed for data intensive computing, including Hadoop, Sector, CloudStore (formerly KFS), HBase, and Thrift. In order to benchmark the performance of these systems, to investigate their interoperability, and to experiment with new services based on flexible compute node and network provisioning capabilities, we have designed and imp…
▽ More
Recently, a number of cloud platforms and services have been developed for data intensive computing, including Hadoop, Sector, CloudStore (formerly KFS), HBase, and Thrift. In order to benchmark the performance of these systems, to investigate their interoperability, and to experiment with new services based on flexible compute node and network provisioning capabilities, we have designed and implemented a large scale testbed called the Open Cloud Testbed (OCT). Currently the OCT has 120 nodes in four data centers: Baltimore, Chicago (two locations), and San Diego. In contrast to other cloud testbeds, which are in small geographic areas and which are based on commodity Internet services, the OCT is a wide area testbed and the four data centers are connected with a high performance 10Gb/s network, based on a foundation of dedicated lightpaths. This testbed can address the requirements of extremely large data streams that challenge other types of distributed infrastructure. We have also developed several utilities to support the development of cloud computing systems and services, including novel node and network provisioning services, a monitoring system, and a RPC system. In this paper, we describe the OCT architecture and monitoring system. We also describe some benchmarks that we developed and some interoperability studies we performed using these benchmarks.
△ Less
Submitted 27 July, 2009;
originally announced July 2009.
-
State Space Realization Theorems For Data Mining
Authors:
Robert L Grossman,
Richard G Larson
Abstract:
In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras.
In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras.
△ Less
Submitted 18 January, 2009;
originally announced January 2009.
-
Dynamics of Black Hole Pairs II: Spherical Orbits and the Homoclinic Limit of Zoom-Whirliness
Authors:
Rebecca Grossman,
Janna Levin
Abstract:
Spinning black hole pairs exhibit a range of complicated dynamical behaviors. An interest in eccentric and zoom-whirl orbits has ironically inspired the focus of this paper: the constant radius orbits. When black hole spins are misaligned, the constant radius orbits are not circles but rather lie on the surface of a sphere and have acquired the name "spherical orbits". The spherical orbits are s…
▽ More
Spinning black hole pairs exhibit a range of complicated dynamical behaviors. An interest in eccentric and zoom-whirl orbits has ironically inspired the focus of this paper: the constant radius orbits. When black hole spins are misaligned, the constant radius orbits are not circles but rather lie on the surface of a sphere and have acquired the name "spherical orbits". The spherical orbits are significant as they energetically frame the distribution of all orbits. In addition, each unstable spherical orbit is asymptotically approached by an orbit that whirls an infinite number of times, known as a homoclinic orbit. A homoclinic trajectory is an infinite whirl limit of the zoom-whirl spectrum and has a further significance as the separatrix between inspiral and plunge for eccentric orbits. We work in the context of two spinning black holes of comparable mass as described in the 3PN Hamiltonian with spin-orbit coupling included. As such, the results could provide a testing ground of the accuracy of the PN expansion. Further, the spherical orbits could provide useful initial data for numerical relativity. Finally, we comment that the spinning black hole pairs should give way to chaos around the homoclinic orbit when spin-spin coupling is incorporated.
△ Less
Submitted 23 November, 2008;
originally announced November 2008.
-
Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data
Authors:
Yunhong Gu,
Robert L Grossman
Abstract:
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also ac…
▽ More
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also across geographically distributed data centers. Similarly, the Sphere compute cloud supports User Defined Functions (UDF) over data both within a data center and across data centers. As a special case, MapReduce style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort Benchmark. In these studies, Sector is about twice as fast as Hadoop. Sector/Sphere is open source.
△ Less
Submitted 16 January, 2009; v1 submitted 6 September, 2008;
originally announced September 2008.
-
Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere
Authors:
Robert L Grossman,
Yunhong Gu
Abstract:
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it p…
▽ More
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.
△ Less
Submitted 21 August, 2008;
originally announced August 2008.
-
Compute and Storage Clouds Using Wide Area High Performance Networks
Authors:
Robert L. Grossman,
Yunhong Gu,
Michael Sabala,
Wanzhi Zhang
Abstract:
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
△ Less
Submitted 13 August, 2008;
originally announced August 2008.
-
Hopf-algebraic structures of families of trees
Authors:
R. L. Grossman,
R. G. Larson
Abstract:
Description of cocommutative Hopf algebras associated with families of trees. Applications include Cayley's theorem on the number of rooted trees with n nodes, and Catalan's theorem on the number of rooted ordered trees with n nodes.
Description of cocommutative Hopf algebras associated with families of trees. Applications include Cayley's theorem on the number of rooted trees with n nodes, and Catalan's theorem on the number of rooted ordered trees with n nodes.
△ Less
Submitted 24 November, 2007;
originally announced November 2007.
-
An Overview of Hopf Algebras of Trees and Their Actions on Functions
Authors:
Robert L. Grossman,
Richard G. Larson
Abstract:
We provide an expository account of some of the Hopf algebras that can be defined using trees, labeled trees, ordered trees and heap ordered trees. We also describe some actions of these Hopf algebras on algebra of functions.
We provide an expository account of some of the Hopf algebras that can be defined using trees, labeled trees, ordered trees and heap ordered trees. We also describe some actions of these Hopf algebras on algebra of functions.
△ Less
Submitted 24 November, 2007;
originally announced November 2007.
-
Hopf Algebras of Heap Ordered Trees and Permutations
Authors:
R. L. Grossman,
R. G. Larson
Abstract:
It is known that there is a Hopf algebra structure on the vector space with basis all heap-ordered trees. We give a new bialgebra structure on the space with basis all permutations and show that there is a direct bialgebra isomorphism between the Hopf algebra of heap-ordered trees and the bialgebra of permutations.
It is known that there is a Hopf algebra structure on the vector space with basis all heap-ordered trees. We give a new bialgebra structure on the space with basis all permutations and show that there is a direct bialgebra isomorphism between the Hopf algebra of heap-ordered trees and the bialgebra of permutations.
△ Less
Submitted 14 November, 2007; v1 submitted 9 June, 2007;
originally announced June 2007.
-
Benefits of Artificially Generated Gravity Gradients for Interferometric Gravitational-Wave Detectors
Authors:
L. Matone,
P. Raffai,
S. Marka,
R. Grossman,
P. Kalmus,
Z. Marka,
J. Rollins,
V. Sannibale
Abstract:
We present an approach to experimentally evaluate gravity gradient noise, a potentially limiting noise source in advanced interferometric gravitational wave (GW) detectors. In addition, the method can be used to provide sub-percent calibration in phase and amplitude of modern interferometric GW detectors. Knowledge of calibration to such certainties shall enhance the scientific output of the ins…
▽ More
We present an approach to experimentally evaluate gravity gradient noise, a potentially limiting noise source in advanced interferometric gravitational wave (GW) detectors. In addition, the method can be used to provide sub-percent calibration in phase and amplitude of modern interferometric GW detectors. Knowledge of calibration to such certainties shall enhance the scientific output of the instruments in case of an eventual detection of GWs. The method relies on a rotating symmetrical two-body mass, a Dynamic gravity Field Generator (DFG). The placement of the DFG in the proximity of one of the interferometer's suspended test masses generates a change in the local gravitational field detectable with current interferometric GW detectors.
△ Less
Submitted 24 January, 2007;
originally announced January 2007.
-
Differential Algebra Structures on Familes of Trees
Authors:
Robert L Grossman,
Richard G Larson
Abstract:
It is known that the vector space spanned by labeled rooted trees forms a Hopf algebra. Let k be a field and let R be a commutative k-algebra. Let H denote the Hopf algebra of rooted trees labeled using derivations D in Der(R). In this paper, we introduce a construction which gives R a H-module algebra structure and show this induces a differential algebra structure of H acting on R. The work he…
▽ More
It is known that the vector space spanned by labeled rooted trees forms a Hopf algebra. Let k be a field and let R be a commutative k-algebra. Let H denote the Hopf algebra of rooted trees labeled using derivations D in Der(R). In this paper, we introduce a construction which gives R a H-module algebra structure and show this induces a differential algebra structure of H acting on R. The work here extends the notion of a R/k-bialgebra introduced by Nichols and Weisfeiler.
△ Less
Submitted 31 August, 2004;
originally announced September 2004.
-
Thermal and Non-thermal Plasmas in the Galaxy Cluster 3C 129
Authors:
H. Krawczynski,
D. E. Harris,
R. Grossman,
W. Lane,
N. Kassim,
A. G. Willis
Abstract:
We describe new Chandra spectroscopy data of the cluster which harbors the prototypical "head tail" radio galaxy 3C 129 and the weaker radio galaxy 3C 129.1. We combined the Chandra data with Very Large Array (VLA) radio data taken at 0.33, 5, and 8 GHz (archival data) and 1.4 GHz (new data). We also obtained new HI observations at the Dominion Radio Astrophysical Observatory (DRAO) to measure t…
▽ More
We describe new Chandra spectroscopy data of the cluster which harbors the prototypical "head tail" radio galaxy 3C 129 and the weaker radio galaxy 3C 129.1. We combined the Chandra data with Very Large Array (VLA) radio data taken at 0.33, 5, and 8 GHz (archival data) and 1.4 GHz (new data). We also obtained new HI observations at the Dominion Radio Astrophysical Observatory (DRAO) to measure the neutral Hydrogen column density in the direction of the cluster with arcminute angular resolution. The Chandra observation reveals extended X-ray emission from the radio galaxy 3C 129.1 with a total luminosity of 1.5E+41 erg/s. The X-ray excess is resolved into an extended central source of ~2 arcsec (1 kpc) diameter and several point sources with an individual luminosity up to 2.1E+40 erg/s. In the case of the radio galaxy 3C 129, the Chandra observation shows, in addition to core and jet X-ray emission reported in an earlier paper, some evidence for extended, diffuse X-ray emission from a region east of the radio core. The 12 arcsec x 36 arcsec (6 kpc x 17 kpc) region lies "in front" of the radio core, in the same direction into which the radio galaxy is moving. We use the radio and X-ray data to study in detail the pressure balance between the non-thermal radio plasma and the thermal Intra Cluster Medium (ICM) along the tail of 3C 129 which extends over 15 arcmin (427 kpc). Depending on the assumed lower energy cutoff of the electron energy spectrum, the minimum pressure of the radio plasma lies a factor of between 10 and 40 below the ICM pressure for a large part of the tail. We discuss several possibilities to explain the apparent pressure mismatch.
△ Less
Submitted 23 July, 2003; v1 submitted 3 February, 2003;
originally announced February 2003.