-
BenthicNet: A global compilation of seafloor images for deep learning applications
Authors:
Scott C. Lowe,
Benjamin Misiuk,
Isaac Xu,
Shakhboz Abdulazizov,
Amit R. Baroi,
Alex C. Bastos,
Merlin Best,
Vicki Ferrini,
Ariell Friedman,
Deborah Hart,
Ove Hoegh-Guldberg,
Daniel Ierodiaconou,
Julia Mackin-McLaughlin,
Kathryn Markey,
Pedro S. Menandro,
Jacquomo Monk,
Shreya Nemani,
John O'Brien,
Elizabeth Oh,
Luba Y. Reshitnyk,
Katleen Robert,
Chris M. Roelfsema,
Jessica A. Sameoto,
Alexandre C. G. Schimel,
Jordan A. Thomson
, et al. (4 additional authors not shown)
Abstract:
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with…
▽ More
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with which seafloor image datasets are analyzed, yet large and consistent datasets necessary to support development of such approaches are scarce. Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models. An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. These are accompanied by 2.6 million annotations translated to the CATAMI scheme, which span 190,000 of the images. A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks. The compilation and model are made openly available for use by the scientific community at https://doi.org/10.20383/103.0614.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks
Authors:
Axel Constant,
Hannes Westermann,
Bryan Wilson,
Alex Kiefer,
Ines Hipolito,
Sylvain Pronovost,
Steven Swanson,
Mahault Albarracin,
Maxwell J. D. Ramstead
Abstract:
Legal autonomy - the lawful activity of artificial intelligence agents - can be achieved in one of two ways. It can be achieved either by imposing constraints on AI actors such as developers, deployers and users, and on AI resources such as data, or by imposing constraints on the range and scope of the impact that AI agents can have on the environment. The latter approach involves encoding extant…
▽ More
Legal autonomy - the lawful activity of artificial intelligence agents - can be achieved in one of two ways. It can be achieved either by imposing constraints on AI actors such as developers, deployers and users, and on AI resources such as data, or by imposing constraints on the range and scope of the impact that AI agents can have on the environment. The latter approach involves encoding extant rules concerning AI driven devices into the software of AI agents controlling those devices (e.g., encoding rules about limitations on zones of operations into the agent software of an autonomous drone device). This is a challenge since the effectivity of such an approach requires a method of extracting, loading, transforming and computing legal information that would be both explainable and legally interoperable, and that would enable AI agents to reason about the law. In this paper, we sketch a proof of principle for such a method using large language models (LLMs), expert legal systems known as legal decision paths, and Bayesian networks. We then show how the proposed method could be applied to extant regulation in matters of autonomous cars, such as the California Vehicle Code.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
FurNav: Development and Preliminary Study of a Robot Direction Giver
Authors:
Bruce W. Wilson,
Yann Schlosser,
Rayane Tarkany,
Meriam Moujahid,
Birthe Nesset,
Tanvi Dinkar,
Verena Rieser
Abstract:
When giving directions to a lost-looking tourist, would you first reference the street-names, cardinal directions, landmarks, or simply tell them to walk five hundred metres in one direction then turn left? Depending on the circumstances, one could reasonably make use of any of these direction giving styles. However, research on direction giving with a robot does not often look at how these differ…
▽ More
When giving directions to a lost-looking tourist, would you first reference the street-names, cardinal directions, landmarks, or simply tell them to walk five hundred metres in one direction then turn left? Depending on the circumstances, one could reasonably make use of any of these direction giving styles. However, research on direction giving with a robot does not often look at how these different direction styles impact perceptions of the robots intelligence, nor does it take into account how users prior dispositions may impact ratings. In this work, we look at generating natural language for two navigation styles using a created system for a Furhat robot, before measuring perceived intelligence and animacy alongside users prior dispositions to robots in a small preliminary study (N=7). Our results confirm findings by previous work that prior negative attitudes towards robots correlates negatively with propensity to trust robots, and also suggests avenues for future research. For example, more data is needed to explore the link between perceived intelligence and direction style. We end by discussing our plan to run a larger scale experiment, and how to improve our existing study design.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Feeding the Coffee Habit: A Longitudinal Study of a Robo-Barista
Authors:
Mei Yii Lim,
David A. Robb,
Bruce W. Wilson,
Helen Hastie
Abstract:
Studying Human-Robot Interaction over time can provide insights into what really happens when a robot becomes part of people's everyday lives. "In the Wild" studies inform the design of social robots, such as for the service industry, to enable them to remain engaging and useful beyond the novelty effect and initial adoption. This paper presents an "In the Wild" experiment where we explored the ev…
▽ More
Studying Human-Robot Interaction over time can provide insights into what really happens when a robot becomes part of people's everyday lives. "In the Wild" studies inform the design of social robots, such as for the service industry, to enable them to remain engaging and useful beyond the novelty effect and initial adoption. This paper presents an "In the Wild" experiment where we explored the evolution of interaction between users and a Robo-Barista. We show that perceived trust and prior attitudes are both important factors associated with the usefulness, adaptability and likeability of the Robo-Barista. A combination of interaction features and user attributes are used to predict user satisfaction. Qualitative insights illuminated users' Robo-Barista experience and contribute to a number of lessons learned for future long-term studies.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
An Empirical Analysis of Range for 3D Object Detection
Authors:
Neehar Peri,
Mengtian Li,
Benjamin Wilson,
Yu-Xiong Wang,
James Hays,
Deva Ramanan
Abstract:
LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical ana…
▽ More
LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical analysis of far-field 3D detection using the long-range detection dataset Argoverse 2.0 to better understand the problem, and share the following insight: near-field LiDAR measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and are better encoded with large voxels. We exploit this observation to build a collection of range experts tuned for near-vs-far field detection, and propose simple techniques to efficiently ensemble models for long-range detection that improve efficiency by 33% and boost accuracy by 3.2% CDS.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
We are all Individuals: The Role of Robot Personality and Human Traits in Trustworthy Interaction
Authors:
Mei Yii Lim,
José David Aguas Lopes,
David A. Robb,
Bruce W. Wilson,
Meriam Moujahid,
Emanuele De Pellegrin,
Helen Hastie
Abstract:
As robots take on roles in our society, it is important that their appearance, behaviour and personality are appropriate for the job they are given and are perceived favourably by the people with whom they interact. Here, we provide an extensive quantitative and qualitative study exploring robot personality but, importantly, with respect to individual human traits. Firstly, we show that we can acc…
▽ More
As robots take on roles in our society, it is important that their appearance, behaviour and personality are appropriate for the job they are given and are perceived favourably by the people with whom they interact. Here, we provide an extensive quantitative and qualitative study exploring robot personality but, importantly, with respect to individual human traits. Firstly, we show that we can accurately portray personality in a social robot, in terms of extroversion-introversion using vocal cues and linguistic features. Secondly, through garnering preferences and trust ratings for these different robot personalities, we establish that, for a Robo-Barista, an extrovert robot is preferred and trusted more than an introvert robot, regardless of the subject's own personality. Thirdly, we find that individual attitudes and predispositions towards robots do impact trust in the Robo-Baristas, and are therefore important considerations in addition to robot personality, roles and interaction context when designing any human-robot interaction study.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Authors:
Benjamin Wilson,
William Qi,
Tanmay Agarwal,
John Lambert,
Jagjeet Singh,
Siddhesh Khandelwal,
Bowen Pan,
Ratnesh Kumar,
Andrew Hartnett,
Jhony Kaesemodel Pontes,
Deva Ramanan,
Peter Carr,
James Hays
Abstract:
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26…
▽ More
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.
△ Less
Submitted 1 January, 2023;
originally announced January 2023.
-
Shallow Water Bathymetry Survey using an Autonomous Surface Vehicle
Authors:
Bibin Wilson,
Anand Singh,
Amit Sethi
Abstract:
Accurate and cost effective map** of water bodies has an enormous significance for environmental understanding and navigation. However, the quantity and quality of information we acquire from such environmental features is limited by various factors, including cost, time, security, and the capabilities of existing data collection techniques. Measurement of water depth is an important part of suc…
▽ More
Accurate and cost effective map** of water bodies has an enormous significance for environmental understanding and navigation. However, the quantity and quality of information we acquire from such environmental features is limited by various factors, including cost, time, security, and the capabilities of existing data collection techniques. Measurement of water depth is an important part of such map**, particularly in shallow locations that could provide navigational risk or have important ecological functions. Erosion and deposition at these locations, for example, due to storms and erosion, can cause rapid changes that require repeated measurements. In this paper, we describe a low-cost, resilient, unmanned autonomous surface vehicle for bathymetry data collection using side-scan sonar. We discuss the adaptation of equipment and sensors for the collection of navigation, control, and bathymetry data and also give an overview of the vehicle setup. This autonomous surface vehicle has been used to collect bathymetry from the Powai Lake in Mumbai, India.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Deriving Surface Resistivity from Polarimetric SAR Data Using Dual-Input UNet
Authors:
Bibin Wilson,
Rajiv Kumar,
Narayanarao Bhogapurapu,
Anand Singh,
Amit Sethi
Abstract:
Traditional survey methods for finding surface resistivity are time-consuming and labor intensive. Very few studies have focused on finding the resistivity/conductivity using remote sensing data and deep learning techniques. In this line of work, we assessed the correlation between surface resistivity and Synthetic Aperture Radar (SAR) by applying various deep learning methods and tested our hypot…
▽ More
Traditional survey methods for finding surface resistivity are time-consuming and labor intensive. Very few studies have focused on finding the resistivity/conductivity using remote sensing data and deep learning techniques. In this line of work, we assessed the correlation between surface resistivity and Synthetic Aperture Radar (SAR) by applying various deep learning methods and tested our hypothesis in the Coso Geothermal Area, USA. For detecting the resistivity, L-band full polarimetric SAR data acquired by UAVSAR were used, and MT (Magnetotellurics) inverted resistivity data of the area were used as the ground truth. We conducted experiments to compare various deep learning architectures and suggest the use of Dual Input UNet (DI-UNet) architecture. DI-UNet uses a deep learning architecture to predict the resistivity using full polarimetric SAR data by promising a quick survey addition to the traditional method. Our proposed approach accomplished improved outcomes for the map** of MT resistivity from SAR data.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
A Relative Church-Turing-Deutsch Thesis from Special Relativity and Undecidability
Authors:
Blake Wilson,
Ethan Dickey,
Vaishnavi Iyer,
Sabre Kais
Abstract:
Beginning with Turing's seminal work in 1950, artificial intelligence proposes that consciousness can be simulated by a Turing machine. This implies a potential theory of everything where the universe is a simulation on a computer, which begs the question of whether we can prove we exist in a simulation. In this work, we construct a relative model of computation where a computable \textit{local} m…
▽ More
Beginning with Turing's seminal work in 1950, artificial intelligence proposes that consciousness can be simulated by a Turing machine. This implies a potential theory of everything where the universe is a simulation on a computer, which begs the question of whether we can prove we exist in a simulation. In this work, we construct a relative model of computation where a computable \textit{local} machine is simulated by a \textit{global}, classical Turing machine. We show that the problem of the local machine computing \textbf{simulation properties} of its global simulator is undecidable in the same sense as the Halting problem. Then, we show that computing the time, space, or error accumulated by the global simulator are simulation properties and therefore are undecidable. These simulation properties give rise to special relativistic effects in the relative model which we use to construct a relative Church-Turing-Deutsch thesis where a global, classical Turing machine computes quantum mechanics for a local machine with the same constant-time local computational complexity as experienced in our universe.
△ Less
Submitted 13 June, 2022;
originally announced June 2022.
-
Mars Terrain Segmentation with Less Labels
Authors:
Edwin Goh,
**gdao Chen,
Brian Wilson
Abstract:
Planetary rover systems need to perform terrain segmentation to identify drivable areas as well as identify specific types of soil for sample collection. The latest Martian terrain segmentation methods rely on supervised learning which is very data hungry and difficult to train where only a small number of labeled samples are available. Moreover, the semantic classes are defined differently for di…
▽ More
Planetary rover systems need to perform terrain segmentation to identify drivable areas as well as identify specific types of soil for sample collection. The latest Martian terrain segmentation methods rely on supervised learning which is very data hungry and difficult to train where only a small number of labeled samples are available. Moreover, the semantic classes are defined differently for different applications (e.g., rover traversal vs. geological) and as a result the network has to be trained from scratch each time, which is an inefficient use of resources. This research proposes a semi-supervised learning framework for Mars terrain segmentation where a deep segmentation network trained in an unsupervised manner on unlabeled images is transferred to the task of terrain segmentation trained on few labeled images. The network incorporates a backbone module which is trained using a contrastive loss function and an output atrous convolution module which is trained using a pixel-wise cross-entropy loss function. Evaluation results using the metric of segmentation accuracy show that the proposed method with contrastive pretraining outperforms plain supervised learning by 2%-10%. Moreover, the proposed model is able to achieve a segmentation accuracy of 91.1% using only 161 training images (1% of the original dataset) compared to 81.9% with plain supervised learning.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Promises and Pitfalls of a New Early Warning System for Gentrification in Buffalo, NY
Authors:
Jan Voltaire Vergara,
Maria Y. Rodriguez,
Ehren Dohler,
Jonathan Phillips,
Melissa Villodas,
Amy Blank Wilson,
Kenneth Joseph
Abstract:
Gentrification and its resultant displacement are one of the many "wicked problems" of social policy. The study of gentrification and displacement spans half a century, concerns a variety of spatial, temporal, and social contexts, and describes socio-political processes of across the globe and throughout history. One current iteration of this field of inquiry are efforts to identify "early indicat…
▽ More
Gentrification and its resultant displacement are one of the many "wicked problems" of social policy. The study of gentrification and displacement spans half a century, concerns a variety of spatial, temporal, and social contexts, and describes socio-political processes of across the globe and throughout history. One current iteration of this field of inquiry are efforts to identify "early indicators" of gentrification and/or displacement, or the creation of "early warning systems" (EWS). The current work adds to scholarship on the utility of develo** an EWS by examining the methodological considerations required for such systems to serve a justice-oriented preventative role.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
Planning for Package Deliveries in Risky Environments Over Multiple Epochs
Authors:
Blake Wilson,
Jeffrey Hudack,
Shreyas Sundaram
Abstract:
We study a risk-aware robot planning problem where a dispatcher must construct a package delivery plan that maximizes the expected reward for a robot delivering packages across multiple epochs. Each package has an associated reward for delivery and a risk of failure. If the robot fails while delivering a package, no future packages can be delivered and the cost of replacing the robot is incurred.…
▽ More
We study a risk-aware robot planning problem where a dispatcher must construct a package delivery plan that maximizes the expected reward for a robot delivering packages across multiple epochs. Each package has an associated reward for delivery and a risk of failure. If the robot fails while delivering a package, no future packages can be delivered and the cost of replacing the robot is incurred. The package delivery plan takes place over the course of either a finite or an infinite number of epochs, denoted as the finite horizon problem and infinite horizon problem, respectively. The dispatcher has to weigh the risk and reward of delivering packages during any given epoch against the potential loss of any future epoch's reward. By using the ratio between a package's reward and its risk of failure, we prove an optimal, greedy solution to both the infinite and finite horizon problems. The finite horizon problem can be solved optimally in $O(K n\log n)$ time where $K$ is the number of epochs and $n$ is the number of packages. We show an isomorphism between the infinite horizon problem and Markov Decision Processes to prove an optimal $O(n)$ time algorithm for the infinite horizon problem.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Identification of Metallic Objects using Spectral Magnetic Polarizability Tensor Signatures: Object Classification
Authors:
B. A. Wilson,
P. D. Ledger,
W. R. B. Lionheart
Abstract:
The early detection of terrorist threat objects, such as guns and knives, through improved metal detection, has the potential to reduce the number of attacks and improve public safety and security. To achieve this, there is considerable potential to use the fields applied and measured by a metal detector to discriminate between different shapes and different metals since, hidden within the field p…
▽ More
The early detection of terrorist threat objects, such as guns and knives, through improved metal detection, has the potential to reduce the number of attacks and improve public safety and security. To achieve this, there is considerable potential to use the fields applied and measured by a metal detector to discriminate between different shapes and different metals since, hidden within the field perturbation, is object characterisation information. The magnetic polarizability tensor (MPT) offers an economical characterisation of metallic objects and its spectral signature provides additional object characterisation information. The MPT spectral signature can be determined from measurements of the induced voltage over a range frequencies in a metal signature for a hidden object. With classification in mind, it can also be computed in advance for different threat and non-threat objects. In the article, we evaluate the performance of probabilistic and non-probabilistic machine learning algorithms, trained using a dictionary of computed MPT spectral signatures, to classify objects for metal detection. We discuss the importances of using appropriate features and selecting an appropriate algorithm depending on the classification problem being solved and we present numerical results for a range of practically motivated metal detection classification problems.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
The Legislative Recipe: Syntax for Machine-Readable Legislation
Authors:
Megan Ma,
Bryan Wilson
Abstract:
Legal interpretation is a linguistic venture. In judicial opinions, for example, courts are often asked to interpret the text of statutes and legislation. As time has shown, this is not always as easy as it sounds. Matters can hinge on vague or inconsistent language and, under the surface, human biases can impact the decision-making of judges. This raises an important question: what if there was a…
▽ More
Legal interpretation is a linguistic venture. In judicial opinions, for example, courts are often asked to interpret the text of statutes and legislation. As time has shown, this is not always as easy as it sounds. Matters can hinge on vague or inconsistent language and, under the surface, human biases can impact the decision-making of judges. This raises an important question: what if there was a method of extracting the meaning of statutes consistently? That is, what if it were possible to use machines to encode legislation in a mathematically precise form that would permit clearer responses to legal questions? This article attempts to unpack the notion of machine-readability, providing an overview of both its historical and recent developments. The paper will reflect on logic syntax and symbolic language to assess the capacity and limits of representing legal knowledge. In doing so, the paper seeks to move beyond existing literature to discuss the implications of various approaches to machine-readable legislation. Importantly, this study hopes to highlight the challenges encountered in this burgeoning ecosystem of machine-readable legislation against existing human-readable counterparts.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.
-
NukeLM: Pre-Trained and Fine-Tuned Language Models for the Nuclear and Energy Domains
Authors:
Lee Burke,
Karl Pazdernik,
Daniel Fortin,
Benjamin Wilson,
Rustam Goychayev,
John Mattingly
Abstract:
Natural language processing (NLP) tasks (text classification, named entity recognition, etc.) have seen revolutionary improvements over the last few years. This is due to language models such as BERT that achieve deep knowledge transfer by using a large pre-trained model, then fine-tuning the model on specific tasks. The BERT architecture has shown even better performance on domain-specific tasks…
▽ More
Natural language processing (NLP) tasks (text classification, named entity recognition, etc.) have seen revolutionary improvements over the last few years. This is due to language models such as BERT that achieve deep knowledge transfer by using a large pre-trained model, then fine-tuning the model on specific tasks. The BERT architecture has shown even better performance on domain-specific tasks when the model is pre-trained using domain-relevant texts. Inspired by these recent advancements, we have developed NukeLM, a nuclear-domain language model pre-trained on 1.5 million abstracts from the U.S. Department of Energy Office of Scientific and Technical Information (OSTI) database. This NukeLM model is then fine-tuned for the classification of research articles into either binary classes (related to the nuclear fuel cycle [NFC] or not) or multiple categories related to the subject of the article. We show that continued pre-training of a BERT-style architecture prior to fine-tuning yields greater performance on both article classification tasks. This information is critical for properly triaging manuscripts, a necessary task for better understanding citation networks that publish in the nuclear space, and for uncovering new areas of research in the nuclear (or nuclear-relevant) domains.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Learning phylogenetic trees as hyperbolic point configurations
Authors:
Benjamin Wilson
Abstract:
We propose a novel method for the inference of phylogenetic trees that utilises point configurations on hyperbolic space as its optimisation landscape. Each taxon corresponds to a point of the point configuration, while the evolutionary distance between taxa is represented by the geodesic distance between their corresponding points. The point configuration is iteratively modified to increase an ob…
▽ More
We propose a novel method for the inference of phylogenetic trees that utilises point configurations on hyperbolic space as its optimisation landscape. Each taxon corresponds to a point of the point configuration, while the evolutionary distance between taxa is represented by the geodesic distance between their corresponding points. The point configuration is iteratively modified to increase an objective function that additively combines pairwise log-likelihood terms. After convergence, the final tree is derived from the inter-point distances using a standard distance-based method. The objective function, which is shown to mimic the log-likelihood on tree space, is a differentiable function on a Riemannian manifold. Thus gradient-based optimisation techniques can be applied, avoiding the need for combinatorial rearrangements of tree topology.
△ Less
Submitted 4 June, 2021; v1 submitted 23 April, 2021;
originally announced April 2021.
-
Scheduling the NASA Deep Space Network with Deep Reinforcement Learning
Authors:
Edwin Goh,
Hamsa Shwetha Venkataram,
Mark Hoffmann,
Mark Johnston,
Brian Wilson
Abstract:
With three complexes spread evenly across the Earth, NASA's Deep Space Network (DSN) is the primary means of communications as well as a significant scientific instrument for dozens of active missions around the world. A rapidly rising number of spacecraft and increasingly complex scientific instruments with higher bandwidth requirements have resulted in demand that exceeds the network's capacity…
▽ More
With three complexes spread evenly across the Earth, NASA's Deep Space Network (DSN) is the primary means of communications as well as a significant scientific instrument for dozens of active missions around the world. A rapidly rising number of spacecraft and increasingly complex scientific instruments with higher bandwidth requirements have resulted in demand that exceeds the network's capacity across its 12 antennae. The existing DSN scheduling process operates on a rolling weekly basis and is time-consuming; for a given week, generation of the final baseline schedule of spacecraft tracking passes takes roughly 5 months from the initial requirements submission deadline, with several weeks of peer-to-peer negotiations in between. This paper proposes a deep reinforcement learning (RL) approach to generate candidate DSN schedules from mission requests and spacecraft ephemeris data with demonstrated capability to address real-world operational constraints. A deep RL agent is developed that takes mission requests for a given week as input, and interacts with a DSN scheduling environment to allocate tracks such that its reward signal is maximized. A comparison is made between an agent trained using Proximal Policy Optimization and its random, untrained counterpart. The results represent a proof-of-concept that, given a well-shaped reward signal, a deep RL agent can learn the complex heuristics used by experts to schedule the DSN. A trained agent can potentially be used to generate candidate schedules to bootstrap the scheduling process and thus reduce the turnaround cycle for DSN scheduling.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
Nine Best Practices for Research Software Registries and Repositories: A Concise Guide
Authors:
Task Force on Best Practices for Software Registries,
:,
Alain Monteil,
Alejandra Gonzalez-Beltran,
Alexandros Ioannidis,
Alice Allen,
Allen Lee,
Anita Bandrowski,
Bruce E. Wilson,
Bryce Mecum,
Cai Fan Du,
Carly Robinson,
Daniel Garijo,
Daniel S. Katz,
David Long,
Genevieve Milliken,
Hervé Ménager,
Jessica Hausman,
Jurriaan H. Spaaks,
Katrina Fenlon,
Kristin Vanderbilt,
Lorraine Hwang,
Lynn Davis,
Martin Fenner,
Michael R. Crusoe
, et al. (8 additional authors not shown)
Abstract:
Scientific software registries and repositories serve various roles in their respective disciplines. These resources improve software discoverability and research transparency, provide information for software citations, and foster preservation of computational methods that might otherwise be lost over time, thereby supporting research reproducibility and replicability. However, develo** these r…
▽ More
Scientific software registries and repositories serve various roles in their respective disciplines. These resources improve software discoverability and research transparency, provide information for software citations, and foster preservation of computational methods that might otherwise be lost over time, thereby supporting research reproducibility and replicability. However, develo** these resources takes effort, and few guidelines are available to help prospective creators of registries and repositories. To address this need, we present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories. These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Citation Implementation Working Group during the years 2019-2020. We believe that putting in place specific policies such as those presented here will help scientific software registries and repositories better serve their users and their disciplines.
△ Less
Submitted 24 December, 2020;
originally announced December 2020.
-
Identification of Metallic Objects using Spectral MPT Signatures: Object Characterisation and Invariants
Authors:
P. D. Ledger,
B. A. Wilson,
A. A. S. Amad,
W. R. B. Lionheart
Abstract:
The early detection of terrorist threats, such as guns and knives, through improved metal detection, has the potential to reduce the number of attacks and improve public safety and security. To achieve this, there is considerable potential to use the fields applied and measured by a metal detector to discriminate between different shapes and different metals since, hidden within the field perturba…
▽ More
The early detection of terrorist threats, such as guns and knives, through improved metal detection, has the potential to reduce the number of attacks and improve public safety and security. To achieve this, there is considerable potential to use the fields applied and measured by a metal detector to discriminate between different shapes and different metals since, hidden within the field perturbation, is object characterisation information. The magnetic polarizability tensor (MPT) offers an economical characterisation of metallic objects that can be computed for different threat and non-threat objects and has an established theoretical background, which shows that the induced voltage is a function of the hidden object's MPT coefficients. In this paper, we describe the additional characterisation information that measurements of the induced voltage over a range of frequencies offer compared to measurements at a single frequency. We call such object characterisations its MPT spectral signature. Then, we present a series of alternative rotational invariants for the purpose of classifying hidden objects using MPT spectral signatures. Finally, we include examples of computed MPT spectral signature characterisations of realistic threat and non-threat objects that can be used to train machine learning algorithms for classification purposes.
△ Less
Submitted 18 December, 2020;
originally announced December 2020.
-
Group isomorphism is nearly-linear time for most orders
Authors:
Heiko Dietrich,
James B. Wilson
Abstract:
We show that there is a dense set $\ourset\subseteq \mathbb{N}$ of group orders and a constant $c$ such that for every $n\in \ourset$ we can decide in time $O(n^2(\log n)^c)$ whether two $n\times n$ multiplication tables describe isomorphic groups of order $n$. This improves significantly over the general $n^{O(\log n)}$-time complexity and shows that group isomorphism can be tested efficiently fo…
▽ More
We show that there is a dense set $\ourset\subseteq \mathbb{N}$ of group orders and a constant $c$ such that for every $n\in \ourset$ we can decide in time $O(n^2(\log n)^c)$ whether two $n\times n$ multiplication tables describe isomorphic groups of order $n$. This improves significantly over the general $n^{O(\log n)}$-time complexity and shows that group isomorphism can be tested efficiently for almost all group orders $n$. We also show that in time $O(n^2
(\log n)^c)$ it can be decided whether an $n\times n$ multiplication table describes a group; this improves over the known $O(n^3)$ complexity. Our complexities are calculated for a deterministic multi-tape Turing machine model. We give the implications to a RAM model in the promise hierarchy as well.
△ Less
Submitted 10 April, 2021; v1 submitted 5 November, 2020;
originally announced November 2020.
-
Identifying Entangled Physics Relationships through Sparse Matrix Decomposition to Inform Plasma Fusion Design
Authors:
M. Giselle Fernández-Godino,
Michael J. Grosskopf,
Julia B. Nakhleh,
Brandon M. Wilson,
John Kline,
Gowri Srinivasan
Abstract:
A sustainable burn platform through inertial confinement fusion (ICF) has been an ongoing challenge for over 50 years. Mitigating engineering limitations and improving the current design involves an understanding of the complex coupling of physical processes. While sophisticated simulations codes are used to model ICF implosions, these tools contain necessary numerical approximation but miss physi…
▽ More
A sustainable burn platform through inertial confinement fusion (ICF) has been an ongoing challenge for over 50 years. Mitigating engineering limitations and improving the current design involves an understanding of the complex coupling of physical processes. While sophisticated simulations codes are used to model ICF implosions, these tools contain necessary numerical approximation but miss physical processes that limit predictive capability. Identification of relationships between controllable design inputs to ICF experiments and measurable outcomes (e.g. yield, shape) from performed experiments can help guide the future design of experiments and development of simulation codes, to potentially improve the accuracy of the computational models used to simulate ICF experiments. We use sparse matrix decomposition methods to identify clusters of a few related design variables. Sparse principal component analysis (SPCA) identifies grou**s that are related to the physical origin of the variables (laser, hohlraum, and capsule). A variable importance analysis finds that in addition to variables highly correlated with neutron yield such as picket power and laser energy, variables that represent a dramatic change of the ICF design such as number of pulse steps are also very important. The obtained sparse components are then used to train a random forest (RF) surrogate for predicting total yield. The RF performance on the training and testing data compares with the performance of the RF surrogate trained using all design variables considered. This work is intended to inform design changes in future ICF experiments by augmenting the expert intuition and simulations results.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Exploring Sensitivity of ICF Outputs to Design Parameters in Experiments Using Machine Learning
Authors:
Julia B. Nakhleh,
M. Giselle Fernández-Godino,
Michael J. Grosskopf,
Brandon M. Wilson,
John Kline,
Gowri Srinivasan
Abstract:
Building a sustainable burn platform in inertial confinement fusion (ICF) requires an understanding of the complex coupling of physical processes and the effects that key experimental design changes have on implosion performance. While simulation codes are used to model ICF implosions, incomplete physics and the need for approximations deteriorate their predictive capability. Identification of rel…
▽ More
Building a sustainable burn platform in inertial confinement fusion (ICF) requires an understanding of the complex coupling of physical processes and the effects that key experimental design changes have on implosion performance. While simulation codes are used to model ICF implosions, incomplete physics and the need for approximations deteriorate their predictive capability. Identification of relationships between controllable design inputs and measurable outcomes can help guide the future design of experiments and development of simulation codes, which can potentially improve the accuracy of the computational models used to simulate ICF implosions. In this paper, we leverage developments in machine learning (ML) and methods for ML feature importance/sensitivity analysis to identify complex relationships in ways that are difficult to process using expert judgment alone. We present work using random forest (RF) regression for prediction of yield, velocity, and other experimental outcomes given a suite of design parameters, along with an assessment of important relationships and uncertainties in the prediction model. We show that RF models are capable of learning and predicting on ICF experimental data with high accuracy, and we extract feature importance metrics that provide insight into the physical significance of different controllable design inputs for various ICF design configurations. These results can be used to augment expert intuition and simulation results for optimal design of future ICF experiments.
△ Less
Submitted 1 September, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Bounds on Sweep-Covers by Raney Numbers
Authors:
Blake Wilson
Abstract:
In this work, we introduce a vertex separator in trees known as a sweep-cover that is defined by an ancestor-descendent relationship with all nodes in the tree. We prove the recurrence relation of sweep-covers with $n$ subcovers $P_{Δ, γ}(n)$ on a class of infinite $Δ$-ary trees with constant path lengths $γ$ between the $Δ$-star internal nodes. Then, we provide recurrence relations for Raney numb…
▽ More
In this work, we introduce a vertex separator in trees known as a sweep-cover that is defined by an ancestor-descendent relationship with all nodes in the tree. We prove the recurrence relation of sweep-covers with $n$ subcovers $P_{Δ, γ}(n)$ on a class of infinite $Δ$-ary trees with constant path lengths $γ$ between the $Δ$-star internal nodes. Then, we provide recurrence relations for Raney numbers over integer compositions and show that they provide a lower-bound for sweep-covers such that $P_{Δ, γ}(n) = Ω\left( \frac{\sqrt{2 π} n^{Δn + Δ+ \frac{3}{2}}}{e^n ((Δ-1)n+Δ+1)!(n+1)!} γ\right)$.
△ Less
Submitted 12 January, 2022; v1 submitted 17 September, 2020;
originally announced September 2020.
-
3D for Free: Crossmodal Transfer Learning using HD Maps
Authors:
Benjamin Wilson,
Zsolt Kira,
James Hays
Abstract:
3D object detection is a core perceptual challenge for robotics and autonomous driving. However, the class-taxonomies in modern autonomous driving datasets are significantly smaller than many influential 2D detection datasets. In this work, we address the long-tail problem by leveraging both the large class-taxonomies of modern 2D datasets and the robustness of state-of-the-art 2D detection method…
▽ More
3D object detection is a core perceptual challenge for robotics and autonomous driving. However, the class-taxonomies in modern autonomous driving datasets are significantly smaller than many influential 2D detection datasets. In this work, we address the long-tail problem by leveraging both the large class-taxonomies of modern 2D datasets and the robustness of state-of-the-art 2D detection methods. We proceed to mine a large, unlabeled dataset of images and LiDAR, and estimate 3D object bounding cuboids, seeded from an off-the-shelf 2D instance segmentation model. Critically, we constrain this ill-posed 2D-to-3D map** by using high-definition maps and object size priors. The result of the mining process is 3D cuboids with varying confidence. This mining process is itself a 3D object detector, although not especially accurate when evaluated as such. However, we then train a 3D object detection model on these cuboids, consistent with other recent observations in the deep learning literature, we find that the resulting model is fairly robust to the noisy supervision that our mining process provides. We mine a collection of 1151 unlabeled, multimodal driving logs from an autonomous vehicle and use the discovered objects to train a LiDAR-based object detector. We show that detector performance increases as we mine more unlabeled data. With our full, unlabeled dataset, our method performs competitively with fully supervised methods, even exceeding the performance for certain object categories, without any human 3D annotations.
△ Less
Submitted 24 August, 2020;
originally announced August 2020.
-
COVID-19 Contact-Tracing Mobile Apps: Evaluation and Assessment for Decision Makers
Authors:
Ramesh Raskar,
Greg Nadeau,
John Werner,
Rachel Barbar,
Ashley Mehra,
Gabriel Harp,
Markus Leopoldseder,
Bryan Wilson,
Derrick Flakoll,
Praneeth Vepakomma,
Deepti Pahwa,
Robson Beaudry,
Emelin Flores,
Maciej Popielarz,
Akanksha Bhatia,
Andrea Nuzzo,
Matt Gee,
Jay Summet,
Rajeev Surati,
Bikram Khastgir,
Francesco Maria Benedetti,
Kristen Vilcans,
Sienna Leis,
Khahlil Louisy
Abstract:
A number of groups, from governments to non-profits, have quickly acted to innovate the contact-tracing process: they are designing, building, and launching contact-tracing apps in response to the COVID-19 crisis. A diverse range of approaches exist, creating challenging choices for officials looking to implement contact-tracing technology in their community and raising concerns about these choice…
▽ More
A number of groups, from governments to non-profits, have quickly acted to innovate the contact-tracing process: they are designing, building, and launching contact-tracing apps in response to the COVID-19 crisis. A diverse range of approaches exist, creating challenging choices for officials looking to implement contact-tracing technology in their community and raising concerns about these choices among citizens asked to participate in contact tracing. We are frequently asked how to evaluate and differentiate between the options for contact-tracing applications. Here, we share the questions we ask about app features and plans when reviewing the many contact-tracing apps appearing on the global stage.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
EigenRank by Committee: A Data Subset Selection and Failure Prediction paradigm for Robust Deep Learning based Medical Image Segmentation
Authors:
Bilwaj Gaonkar,
Joel Beckett,
Mark Attiah,
Christine Ahn,
Matthew Edwards,
Bayard Wilson,
Azim Laiwalla,
Banafsheh Salehi,
Bryan Yoo,
Alex Bui,
Luke Macyszyn
Abstract:
Translation of fully automated deep learning based medical image segmentation technologies to clinical workflows face two main algorithmic challenges. The first, is the collection and archival of large quantities of manually annotated ground truth data for both training and validation. The second is the relative inability of the majority of deep learning based segmentation techniques to alert phys…
▽ More
Translation of fully automated deep learning based medical image segmentation technologies to clinical workflows face two main algorithmic challenges. The first, is the collection and archival of large quantities of manually annotated ground truth data for both training and validation. The second is the relative inability of the majority of deep learning based segmentation techniques to alert physicians to a likely segmentation failure. Here we propose a novel algorithm, named `Eigenrank' which addresses both of these challenges. Eigenrank can select for manual labeling, a subset of medical images from a large database, such that a U-Net trained on this subset is superior to one trained on a randomly selected subset of the same size. Eigenrank can also be used to pick out, cases in a large database, where deep learning segmentation will fail. We present our algorithm, followed by results and a discussion of how Eigenrank exploits the Von Neumann information to perform both data subset selection and failure prediction for medical image segmentation using deep learning.
△ Less
Submitted 18 January, 2021; v1 submitted 17 August, 2019;
originally announced August 2019.
-
Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images, including Supplementary Information
Authors:
Charles B. Delahunt,
Mayoore S. Jaiswal,
Matthew P. Horning,
Samantha Janko,
Clay M. Thompson,
Sourabh Kulhare,
Liming Hu,
Travis Ostbye,
Grace Yun,
Roman Gebrehiwot,
Benjamin K. Wilson,
Earl Long,
Stephane Proux,
Dionicia Gamboa,
Peter Chiodini,
Jane Carter,
Mehul Dhorda,
David Isaboke,
Bernhards Ogutu,
Wellington Oyibo,
Elizabeth Villasis,
Kyaw Myo Tun,
Christine Bachman,
David Bell,
Courosh Mehanian
Abstract:
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumb…
▽ More
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumber relatively rare parasites. In this work, we describe a complete, fully-automated framework for thin film malaria analysis that applies ML methods, including convolutional neural nets (CNNs), trained on a large and diverse dataset of field-prepared thin blood films. Quantitation and species identification results are close to sufficiently accurate for the concrete needs of drug resistance monitoring and clinical use-cases on field-prepared samples. We focus our methods and our performance metrics on the field use-case requirements. We discuss key issues and important metrics for the application of ML methods to malaria microscopy.
△ Less
Submitted 11 September, 2022; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Incorporating Weisfeiler-Leman into algorithms for group isomorphism
Authors:
Peter A. Brooksbank,
Joshua A. Grochow,
Yinan Li,
Youming Qiao,
James B. Wilson
Abstract:
In this paper we combine many of the standard and more recent algebraic techniques for testing isomorphism of finite groups (GpI) with combinatorial techniques that have typically been applied to Graph Isomorphism. In particular, we show how to combine several state-of-the-art GpI algorithms for specific group classes into an algorithm for general GpI, namely: composition series isomorphism (Rosen…
▽ More
In this paper we combine many of the standard and more recent algebraic techniques for testing isomorphism of finite groups (GpI) with combinatorial techniques that have typically been applied to Graph Isomorphism. In particular, we show how to combine several state-of-the-art GpI algorithms for specific group classes into an algorithm for general GpI, namely: composition series isomorphism (Rosenbaum-Wagner, Theoret. Comp. Sci., 2015; Luks, 2015), recursively-refineable filters (Wilson, J. Group Theory, 2013), and low-genus GpI (Brooksbank-Maglione-Wilson, J. Algebra, 2017). Recursively-refineable filters -- a generalization of subgroup series -- form the skeleton of this framework, and we refine our filter by building a hypergraph encoding low-genus quotients, to which we then apply a hypergraph variant of the k-dimensional Weisfeiler-Leman technique. Our technique is flexible enough to readily incorporate additional hypergraph invariants or additional characteristic subgroups.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Opportunity costs in the game of best choice
Authors:
Madeline Crews,
Brant Jones,
Kaitlyn Myers,
Laura Taalman,
Michael Urbanski,
Breeann Wilson
Abstract:
The game of best choice, also known as the secretary problem, is a model for sequential decision making with many variations in the literature. Notably, the classical setup assumes that the sequence of candidate rankings is uniformly distributed over time and that there is no expense associated with the candidate interviews. Here, we weight each ranking permutation according to the position of the…
▽ More
The game of best choice, also known as the secretary problem, is a model for sequential decision making with many variations in the literature. Notably, the classical setup assumes that the sequence of candidate rankings is uniformly distributed over time and that there is no expense associated with the candidate interviews. Here, we weight each ranking permutation according to the position of the best candidate in order to model costs incurred from conducting interviews with candidates that are ultimately not hired. We compare our weighted model with the classical (uniform) model via a limiting process. It turns out that imposing even infinitesimal costs on the interviews results in a probability of success that is about 28%, as opposed to 1/e (about 37%) in the classical case.
△ Less
Submitted 12 March, 2019; v1 submitted 5 March, 2019;
originally announced March 2019.
-
Predictive Inequity in Object Detection
Authors:
Benjamin Wilson,
Judy Hoffman,
Jamie Morgenstern
Abstract:
In this work, we investigate whether state-of-the-art object detection systems have equitable predictive performance on pedestrians with different skin tones. This work is motivated by many recent examples of ML and vision systems displaying higher error rates for certain demographic groups than others. We annotate an existing large scale dataset which contains pedestrians, BDD100K, with Fitzpatri…
▽ More
In this work, we investigate whether state-of-the-art object detection systems have equitable predictive performance on pedestrians with different skin tones. This work is motivated by many recent examples of ML and vision systems displaying higher error rates for certain demographic groups than others. We annotate an existing large scale dataset which contains pedestrians, BDD100K, with Fitzpatrick skin tones in ranges [1-3] or [4-6]. We then provide an in-depth comparative analysis of performance between these two skin tone grou**s, finding that neither time of day nor occlusion explain this behavior, suggesting this disparity is not merely the result of pedestrians in the 4-6 range appearing in more difficult scenes for detection. We investigate to what extent time of day, occlusion, and reweighting the supervised loss during training affect this predictive bias.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Skip-gram word embeddings in hyperbolic space
Authors:
Matthias Leimeister,
Benjamin J. Wilson
Abstract:
Recent work has demonstrated that embeddings of tree-like graphs in hyperbolic space surpass their Euclidean counterparts in performance by a large margin. Inspired by these results and scale-free structure in the word co-occurrence graph, we present an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance is derived and i…
▽ More
Recent work has demonstrated that embeddings of tree-like graphs in hyperbolic space surpass their Euclidean counterparts in performance by a large margin. Inspired by these results and scale-free structure in the word co-occurrence graph, we present an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance is derived and included in the skip-gram negative-sampling architecture of word2vec. The hyperbolic word embeddings are then evaluated on word similarity and analogy benchmarks. The results demonstrate the potential of hyperbolic word embeddings, particularly in low dimensions, though without clear superiority over their Euclidean counterparts. We further discuss subtleties in the formulation of the analogy task in curved spaces.
△ Less
Submitted 27 May, 2019; v1 submitted 30 August, 2018;
originally announced September 2018.
-
MARVIN: An Open Machine Learning Corpus and Environment for Automated Machine Learning Primitive Annotation and Execution
Authors:
Chris A. Mattmann,
Sujen Shah,
Brian Wilson
Abstract:
In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL's MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of…
▽ More
In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL's MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing.
△ Less
Submitted 11 August, 2018;
originally announced August 2018.
-
Controlled Experiments for Word Embeddings
Authors:
Benjamin J. Wilson,
Adriaan M. J. Schakel
Abstract:
An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. The approach is demonstrated using the word2vec CBOW model with experiments that independently vary word frequency and word co-occ…
▽ More
An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. The approach is demonstrated using the word2vec CBOW model with experiments that independently vary word frequency and word co-occurrence noise. The experiments reveal that word vector length depends more or less linearly on both word frequency and the level of noise in the co-occurrence distribution of the word. The coefficients of linearity depend upon the word. The special point in feature space, defined by the (artificial) word with pure noise in its co-occurrence distribution, is found to be small but non-zero.
△ Less
Submitted 14 December, 2015; v1 submitted 9 October, 2015;
originally announced October 2015.
-
Measuring Word Significance using Distributed Representations of Words
Authors:
Adriaan M. J. Schakel,
Benjamin J. Wilson
Abstract:
Distributed representations of words as real-valued vectors in a relatively low-dimensional space aim at extracting syntactic and semantic features from large text corpora. A recently introduced neural network, named word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b), was shown to encode semantic information in the direction of the word vectors. In this brief report, it is proposed to use the…
▽ More
Distributed representations of words as real-valued vectors in a relatively low-dimensional space aim at extracting syntactic and semantic features from large text corpora. A recently introduced neural network, named word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b), was shown to encode semantic information in the direction of the word vectors. In this brief report, it is proposed to use the length of the vectors, together with the term frequency, as measure of word significance in a corpus. Experimental evidence using a domain-specific corpus of abstracts is presented to support this proposal. A useful visualization technique for text corpora emerges, where words are mapped onto a two-dimensional plane and automatically ranked by significance.
△ Less
Submitted 10 August, 2015;
originally announced August 2015.
-
A forward-backward single-source shortest paths algorithm
Authors:
David B. Wilson,
Uri Zwick
Abstract:
We describe a new forward-backward variant of Dijkstra's and Spira's Single-Source Shortest Paths (SSSP) algorithms. While essentially all SSSP algorithm only scan edges forward, the new algorithm scans some edges backward. The new algorithm assumes that edges in the outgoing and incoming adjacency lists of the vertices appear in non-decreasing order of weight. (Spira's algorithm makes the same as…
▽ More
We describe a new forward-backward variant of Dijkstra's and Spira's Single-Source Shortest Paths (SSSP) algorithms. While essentially all SSSP algorithm only scan edges forward, the new algorithm scans some edges backward. The new algorithm assumes that edges in the outgoing and incoming adjacency lists of the vertices appear in non-decreasing order of weight. (Spira's algorithm makes the same assumption about the outgoing adjacency lists, but does not use incoming adjacency lists.) The running time of the algorithm on a complete directed graph on $n$ vertices with independent exponential edge weights is $O(n)$, with very high probability. This improves on the previously best result of $O(n\log n)$, which is best possible if only forward scans are allowed, exhibiting an interesting separation between forward-only and forward-backward SSSP algorithms. As a consequence, we also get a new all-pairs shortest paths algorithm. The expected running time of the algorithm on complete graphs with independent exponential edge weights is $O(n^2)$, matching a recent algorithm of Demetrescu and Italiano as analyzed by Peres et al. Furthermore, the probability that the new algorithm requires more than $O(n^2)$ time is exponentially small, improving on the $O(n^{-1/26})$ probability bound obtained by Peres et al.
△ Less
Submitted 29 May, 2014;
originally announced May 2014.
-
The min mean-weight cycle in a random network
Authors:
Claire Mathieu,
David B. Wilson
Abstract:
The mean weight of a cycle in an edge-weighted graph is the sum of the cycle's edge weights divided by the cycle's length. We study the minimum mean-weight cycle on the complete graph on n vertices, with random i.i.d. edge weights drawn from an exponential distribution with mean 1. We show that the probability of the min mean weight being at most c/n tends to a limiting function of c which is anal…
▽ More
The mean weight of a cycle in an edge-weighted graph is the sum of the cycle's edge weights divided by the cycle's length. We study the minimum mean-weight cycle on the complete graph on n vertices, with random i.i.d. edge weights drawn from an exponential distribution with mean 1. We show that the probability of the min mean weight being at most c/n tends to a limiting function of c which is analytic for c<=1/e, discontinuous at c=1/e, and equal to 1 for c>1/e. We further show that if the min mean weight is <=1/(en), then the length of the relevant cycle is Theta_p(1) (i.e., it has a limiting probability distribution which does not scale with n), but that if the min mean weight is >1/(en), then the relevant cycle almost always has mean weight (1+o(1))/(en) and length at least (2/pi^2-o(1)) log^2 n log log n.
△ Less
Submitted 5 July, 2013; v1 submitted 18 January, 2012;
originally announced January 2012.
-
Advancements in scientific data searching, sharing and retrieval
Authors:
Ranjeet Devarakonda,
Giri Palanisamy,
Bruce Wilson
Abstract:
The Open Archive Initiative Protocol for Metadata Handling (OAI-PMHiii) is a standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH implementations must support Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools which enable Mercury to consume metadata from OAI-PMH services in any of the metadata formats we sup…
▽ More
The Open Archive Initiative Protocol for Metadata Handling (OAI-PMHiii) is a standard that is seeing increased use as a means for exchanging structured metadata. OAI-PMH implementations must support Dublin Core as a metadata standard, with other metadata formats as optional. We have developed tools which enable Mercury to consume metadata from OAI-PMH services in any of the metadata formats we support (Dublin Core, Darwin Core, FCDC CSDGM, GCMD DIF, EML, and ISO 19115/19137). We are also making ORNL DAAC metadata available through OAI-PMH for other metadata tools to utilize. This paper describes Mercury capabilities with multiple metadata formats, in general, and, more specifically, the results of our OAI-PMH implementations and the lessons learned.
△ Less
Submitted 29 December, 2010; v1 submitted 19 October, 2010;
originally announced October 2010.
-
Enabling Data Discovery through Virtual Internet Repositories
Authors:
Giriprakash Palanisamy,
Ranjeet Devarakonda,
Jim Green,
Bruce Wilson
Abstract:
Mercury is a federated metadata harvesting, search and retrieval tool based on both open source and software developed at Oak Ridge National Laboratory. It was originally developed for NASA, and the Mercury development consortium now includes funding from NASA, USGS, and DOE. A major new version of Mercury was developed during 2007. This new version provides orders of magnitude improvements in sea…
▽ More
Mercury is a federated metadata harvesting, search and retrieval tool based on both open source and software developed at Oak Ridge National Laboratory. It was originally developed for NASA, and the Mercury development consortium now includes funding from NASA, USGS, and DOE. A major new version of Mercury was developed during 2007. This new version provides orders of magnitude improvements in search speed, support for additional metadata formats, integration with Google Maps for spatial queries, support for RSS delivery of search results, among other features. Mercury provides a single portal to information contained in disparate data management systems. It collects metadata and key data from contributing project servers distributed around the world and builds a centralized index. The Mercury search interfaces then allow the users to perform simple, fielded, spatial and temporal searches across these metadata sources. This centralized repository of metadata with distributed data sources provides extremely fast search results to the user, while allowing data providers to advertise the availability of their data and maintain complete control and ownership of that data.
△ Less
Submitted 19 October, 2010; v1 submitted 12 October, 2010;
originally announced October 2010.
-
Balanced Boolean functions that can be evaluated so that every input bit is unlikely to be read
Authors:
Itai Benjamini,
Oded Schramm,
David B. Wilson
Abstract:
A Boolean function of n bits is balanced if it takes the value 1 with probability 1/2. We exhibit a balanced Boolean function with a randomized evaluation procedure (with probability 0 of making a mistake) so that on uniformly random inputs, no input bit is read with probability more than Theta(n^{-1/2} sqrt{log n}). We give a balanced monotone Boolean function for which the corresponding probab…
▽ More
A Boolean function of n bits is balanced if it takes the value 1 with probability 1/2. We exhibit a balanced Boolean function with a randomized evaluation procedure (with probability 0 of making a mistake) so that on uniformly random inputs, no input bit is read with probability more than Theta(n^{-1/2} sqrt{log n}). We give a balanced monotone Boolean function for which the corresponding probability is Theta(n^{-1/3} log n). We then show that for any randomized algorithm for evaluating a balanced Boolean function, when the input bits are uniformly random, there is some input bit that is read with probability at least Theta(n^{-1/2}). For balanced monotone Boolean functions, there is some input bit that is read with probability at least Theta(n^{-1/3}).
△ Less
Submitted 11 October, 2004;
originally announced October 2004.