-
Map** of Internet "Coastlines" via Large Scale Anonymized Network Source Correlations
Authors:
Hayden Jananthan,
Jeremy Kepner,
Michael Jones,
William Arcand,
David Bestor,
William Bergeron,
Chansup Byun,
Timothy Davis,
Vijay Gadepally,
Daniel Grant,
Michael Houle,
Matthew Hubbell,
Anna Klein,
Lauren Milechin,
Guillermo Morales,
Andrew Morris,
Julie Mullen,
Ritesh Patel,
Alex Pentland,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther,
Antonio Rosa,
Siddharth Samsi,
Tyler Trigg
, et al. (3 additional authors not shown)
Abstract:
Expanding the scientific tools available to protect computer networks can be aided by a deeper understanding of the underlying statistical distributions of network traffic and their potential geometric interpretations. Analyses of large scale network observations provide a unique window into studying those underlying statistics. Newly developed GraphBLAS hypersparse matrices and D4M associative ar…
▽ More
Expanding the scientific tools available to protect computer networks can be aided by a deeper understanding of the underlying statistical distributions of network traffic and their potential geometric interpretations. Analyses of large scale network observations provide a unique window into studying those underlying statistics. Newly developed GraphBLAS hypersparse matrices and D4M associative array technologies enable the efficient anonymized analysis of network traffic on the scale of trillions of events. This work analyzes over 100,000,000,000 anonymized packets from the largest Internet telescope (CAIDA) and over 10,000,000 anonymized sources from the largest commercial honeyfarm (GreyNoise). Neither CAIDA nor GreyNoise actively emit Internet traffic and provide distinct observations of unsolicited Internet traffic (primarily botnets and scanners). Analysis of these observations confirms the previously observed Cauchy-like distributions describing temporal correlations between Internet sources. The Gull lighthouse problem is a well-known geometric characterization of the standard Cauchy distribution and motivates a potential geometric interpretation for Internet observations. This work generalizes the Gull lighthouse problem to accommodate larger classes of coastlines, deriving a closed-form solution for the resulting probability distributions, stating and examining the inverse problem of identifying an appropriate coastline given a continuous probability distribution, identifying a geometric heuristic for solving this problem computationally, and applying that heuristic to examine the temporal geometry of different subsets of network observations. Application of this method to the CAIDA and GreyNoise data reveals a several orders of magnitude difference between known benign and other traffic which can lead to potentially novel ways to protect networks.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
Observation of high-energy neutrinos from the Galactic plane
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
S. W. Barwick,
V. Basu,
S. Baur,
R. Bay,
J. J. Beatty,
K. -H. Becker,
J. Becker Tjus
, et al. (364 additional authors not shown)
Abstract:
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrin…
▽ More
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrino emission using machine learning techniques applied to ten years of data from the IceCube Neutrino Observatory. We identify neutrino emission from the Galactic plane at the 4.5$σ$ level of significance, by comparing diffuse emission models to a background-only hypothesis. The signal is consistent with modeled diffuse emission from the Galactic plane, but could also arise from a population of unresolved point sources.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Adversarial Machine Learning and Cybersecurity: Risks, Challenges, and Legal Implications
Authors:
Micah Musser,
Andrew Lohn,
James X. Dempsey,
Jonathan Spring,
Ram Shankar Siva Kumar,
Brenda Leong,
Christina Liaghati,
Cindy Martinez,
Crystal D. Grant,
Daniel Rohrer,
Heather Frase,
Jonathan Elliott,
John Bansemer,
Mikel Rodriguez,
Mitt Regan,
Rumman Chowdhury,
Stefan Hermanek
Abstract:
In July 2022, the Center for Security and Emerging Technology (CSET) at Georgetown University and the Program on Geopolitics, Technology, and Governance at the Stanford Cyber Policy Center convened a workshop of experts to examine the relationship between vulnerabilities in artificial intelligence systems and more traditional types of software vulnerabilities. Topics discussed included the extent…
▽ More
In July 2022, the Center for Security and Emerging Technology (CSET) at Georgetown University and the Program on Geopolitics, Technology, and Governance at the Stanford Cyber Policy Center convened a workshop of experts to examine the relationship between vulnerabilities in artificial intelligence systems and more traditional types of software vulnerabilities. Topics discussed included the extent to which AI vulnerabilities can be handled under standard cybersecurity processes, the barriers currently preventing the accurate sharing of information about AI vulnerabilities, legal issues associated with adversarial attacks on AI systems, and potential areas where government support could improve AI vulnerability management and mitigation.
This report is meant to accomplish two things. First, it provides a high-level discussion of AI vulnerabilities, including the ways in which they are disanalogous to other types of vulnerabilities, and the current state of affairs regarding information sharing and legal oversight of AI vulnerabilities. Second, it attempts to articulate broad recommendations as endorsed by the majority of participants at the workshop.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic (Enriquecimiento a gran escala y caracterización cibernética estadística del tráfico de red)
Authors:
Ivan Kawaminami,
Arminda Estrada,
Youssef Elsakkary,
Hayden Jananthan,
Aydın Buluç,
Tim Davis,
Daniel Grant,
Michael Jones,
Chad Meiners,
Andrew Morris,
Sandeep Pisharody,
Jeremy Kepner
Abstract:
Modern network sensors continuously produce enormous quantities of raw data that are beyond the capacity of human analysts. Cross-correlation of network sensors increases this challenge by enriching every network event with additional metadata. These large volumes of enriched network data present opportunities to statistically characterize network traffic and quickly answer a key question: "What a…
▽ More
Modern network sensors continuously produce enormous quantities of raw data that are beyond the capacity of human analysts. Cross-correlation of network sensors increases this challenge by enriching every network event with additional metadata. These large volumes of enriched network data present opportunities to statistically characterize network traffic and quickly answer a key question: "What are the primary cyber characteristics of my network data?" The Python GraphBLAS and PyD4M analysis frameworks enable anonymized statistical analysis to be performed quickly and efficiently on very large network data sets. This approach is tested using billions of anonymized network data samples from the largest Internet observatory (CAIDA Telescope) and tens of millions of anonymized records from the largest commercially available background enrichment capability (GreyNoise). The analysis confirms that most of the enriched variables follow expected heavy-tail distributions and that a large fraction of the network traffic is due to a small number of cyber activities. This information can simplify the cyber analysts' task by enabling prioritization of cyber activities based on statistical prevalence.
--
Los sensores de red modernos producen enormes cantidades de datos sin procesar que están más allá de la capacidad del análisis humano. Una correlación cruzada de sensores de red se convierte en un desafío al enriquecer cada evento de red con metadatos adicionales. Estos grandes volúmenes de datos de red enriquecidos presentan una oportunidad para caracterizar estadísticamente el tráfico de red y responder a la pregunta: "?Cuáles son las principales características cibernéticas de mis datos de red?" Los esquemas de análisis de Python GraphBLAS y D4M permiten realizar análisis estadísticos anónimos, rápidos y eficientes en conjuntos grandes de datos de red. Este enfoque se prueba utilizando miles de millones de muestras de datos de red anónimos del observatorio de Internet más grande (Telescopio CAIDA) y decenas de millones de registros anónimos del fondo comercial con la mayor capacidad de enriquecimiento (GreyNoise). El análisis confirma que la mayoría de las variables enriquecidas siguen las distribuciones de cola pesada y que una gran fracción del tráfico de red se debe a una pequena cantidad de actividades cibernéticas. Esta información puede simplificar la tarea de los analistas cibernéticos al permitir la priorización de las actividades cibernéticas en función de la prevalencia estadística.
△ Less
Submitted 1 December, 2022; v1 submitted 7 September, 2022;
originally announced September 2022.
-
Graph Neural Networks for Low-Energy Event Classification & Reconstruction in IceCube
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
N. Aggarwal,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
J. M. Alameddine,
A. A. Alves Jr.,
N. M. Amin,
K. Andeen,
T. Anderson,
G. Anton,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
S. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
V. Basu,
R. Bay,
J. J. Beatty,
K. -H. Becker
, et al. (359 additional authors not shown)
Abstract:
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challen…
▽ More
IceCube, a cubic-kilometer array of optical sensors built to detect atmospheric and astrophysical neutrinos between 1 GeV and 1 PeV, is deployed 1.45 km to 2.45 km below the surface of the ice sheet at the South Pole. The classification and reconstruction of events from the in-ice detectors play a central role in the analysis of data from IceCube. Reconstructing and classifying events is a challenge due to the irregular detector geometry, inhomogeneous scattering and absorption of light in the ice and, below 100 GeV, the relatively low number of signal photons produced per event. To address this challenge, it is possible to represent IceCube events as point cloud graphs and use a Graph Neural Network (GNN) as the classification and reconstruction method. The GNN is capable of distinguishing neutrino events from cosmic-ray backgrounds, classifying different neutrino event types, and reconstructing the deposited energy, direction and interaction vertex. Based on simulation, we provide a comparison in the 1-100 GeV energy range to the current state-of-the-art maximum likelihood techniques used in current IceCube analyses, including the effects of known systematic uncertainties. For neutrino event classification, the GNN increases the signal efficiency by 18% at a fixed false positive rate (FPR), compared to current IceCube methods. Alternatively, the GNN offers a reduction of the FPR by over a factor 8 (to below half a percent) at a fixed signal efficiency. For the reconstruction of energy, direction, and interaction vertex, the resolution improves by an average of 13%-20% compared to current maximum likelihood techniques in the energy range of 1-30 GeV. The GNN, when run on a GPU, is capable of processing IceCube events at a rate nearly double of the median IceCube trigger rate of 2.7 kHz, which opens the possibility of using low energy neutrinos in online searches for transient events.
△ Less
Submitted 11 October, 2022; v1 submitted 7 September, 2022;
originally announced September 2022.
-
Temporal Correlation of Internet Observatories and Outposts
Authors:
Jeremy Kepner,
Michael Jones,
Daniel Andersen,
Aydın Buluç,
Chansup Byun,
K Claffy,
Timothy Davis,
William Arcand,
Jonathan Bernays,
David Bestor,
William Bergeron,
Vijay Gadepally,
Daniel Grant,
Micheal Houle,
Matthew Hubbell,
Hayden Jananthan,
Anna Klein,
Chad Meiners,
Lauren Milechin,
Andrew Morris,
Julie Mullen,
Sandeep Pisharody,
Andrew Prout,
Albert Reuther,
Antonio Rosa
, et al. (4 additional authors not shown)
Abstract:
The Internet has become a critical component of modern civilization requiring scientific exploration akin to endeavors to understand the land, sea, air, and space environments. Understanding the baseline statistical distributions of traffic are essential to the scientific understanding of the Internet. Correlating data from different Internet observatories and outposts can be a useful tool for gai…
▽ More
The Internet has become a critical component of modern civilization requiring scientific exploration akin to endeavors to understand the land, sea, air, and space environments. Understanding the baseline statistical distributions of traffic are essential to the scientific understanding of the Internet. Correlating data from different Internet observatories and outposts can be a useful tool for gaining insights into these distributions. This work compares observed sources from the largest Internet telescope (the CAIDA darknet telescope) with those from a commercial outpost (the GreyNoise honeyfarm). Neither of these locations actively emit Internet traffic and provide distinct observations of unsolicited Internet traffic (primarily botnets and scanners). Newly developed GraphBLAS hyperspace matrices and D4M associative array technologies enable the efficient analysis of these data on significant scales. The CAIDA sources are well approximated by a Zipf-Mandelbrot distribution. Over a 6-month period 70\% of the brightest (highest frequency) sources in the CAIDA telescope are consistently detected by coeval observations in the GreyNoise honeyfarm. This overlap drops as the sources dim (reduce frequency) and as the time difference between the observations grows. The probability of seeing a CAIDA source is proportional to the logarithm of the brightness. The temporal correlations are well described by a modified Cauchy distribution. These observations are consistent with a correlated high frequency beam of sources that drifts on a time scale of a month.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Deep2Lead: A distributed deep learning application for small molecule lead optimization
Authors:
Tarun Kumar Chawdhury,
David J. Grant,
Hyun Yong **
Abstract:
Lead optimization is a key step in drug discovery to produce potent and selective compounds. Historically, in silico screening and structure-based small molecule designing facilitated the processes. Although the recent application of deep learning to drug discovery piloted the possibility of their in silico application lead optimization steps, the real-world application is lacking due to the tool…
▽ More
Lead optimization is a key step in drug discovery to produce potent and selective compounds. Historically, in silico screening and structure-based small molecule designing facilitated the processes. Although the recent application of deep learning to drug discovery piloted the possibility of their in silico application lead optimization steps, the real-world application is lacking due to the tool availability. Here, we developed a single user interface application, called Deep2Lead. Our web-based application integrates VAE and DeepPurpose DTI and allows a user to quickly perform a lead optimization task with no prior programming experience.
△ Less
Submitted 9 August, 2021;
originally announced August 2021.
-
A Convolutional Neural Network based Cascade Reconstruction for the IceCube Neutrino Observatory
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
M. Ahrens,
C. Alispach,
A. A. Alves Jr.,
N. M. Amin,
R. An,
K. Andeen,
T. Anderson,
I. Ansseau,
G. Anton,
C. Argüelles,
S. Axani,
X. Bai,
A. Balagopal V.,
A. Barbano,
S. W. Barwick,
B. Bastian,
V. Basu,
V. Baum,
S. Baur,
R. Bay
, et al. (343 additional authors not shown)
Abstract:
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue alerts to telescopes around the world, powerful an…
▽ More
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue alerts to telescopes around the world, powerful and fast reconstruction methods are desired. Deep neural networks can be extremely powerful, and their usage is computationally inexpensive once the networks are trained. These characteristics make a deep learning-based approach an excellent candidate for the application in IceCube. A reconstruction method based on convolutional architectures and hexagonally shaped kernels is presented. The presented method is robust towards systematic uncertainties in the simulation and has been tested on experimental data. In comparison to standard reconstruction methods in IceCube, it can improve upon the reconstruction accuracy, while reducing the time necessary to run the reconstruction by two to three orders of magnitude.
△ Less
Submitted 26 July, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Differential Codes on Higher Dimensional Varieties Via Grothendieck's Residue Symbol
Authors:
David Grant,
John D. Massman, III,
S. Srimathy
Abstract:
We give a new construction of linear codes over finite fields on higher dimensional varieties using Grothendieck's theory of residues. This generalizes the construction of differential codes over curves to varieties of higher dimensions.
We give a new construction of linear codes over finite fields on higher dimensional varieties using Grothendieck's theory of residues. This generalizes the construction of differential codes over curves to varieties of higher dimensions.
△ Less
Submitted 5 February, 2024; v1 submitted 19 September, 2020;
originally announced September 2020.
-
A Corpus for Detecting High-Context Medical Conditions in Intensive Care Patient Notes Focusing on Frequently Readmitted Patients
Authors:
Edward T. Moseley,
Joy T. Wu,
Jonathan Welt,
John Foote,
Patrick D. Tyler,
David W. Grant,
Eric T. Carlson,
Sebastian Gehrmann,
Franck Dernoncourt,
Leo Anthony Celi
Abstract:
A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treat…
▽ More
A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treatments from their written notes. In this paper, we introduce a dataset for patient phenoty**, a task that is defined as the identification of whether a patient has a given medical condition (also referred to as clinical indication or phenotype) based on their patient note. Nursing Progress Notes and Discharge Summaries from the Intensive Care Unit of a large tertiary care hospital were manually annotated for the presence of several high-context phenotypes relevant to treatment and risk of re-hospitalization. This dataset contains 1102 Discharge Summaries and 1000 Nursing Progress Notes. Each Discharge Summary and Progress Note has been annotated by at least two expert human annotators (one clinical researcher and one resident physician). Annotated phenotypes include treatment non-adherence, chronic pain, advanced/metastatic cancer, as well as 10 other phenotypes. This dataset can be utilized for academic and industrial research in medicine and computer science, particularly within the field of medical natural language processing.
△ Less
Submitted 6 March, 2020;
originally announced March 2020.
-
Indicators of retention in remote digital health studies: A cross-study evaluation of 100,000 participants
Authors:
Abhishek Pratap,
Elias Chaibub Neto,
Phil Snyder,
Carl Stepnowsky,
Noémie Elhadad,
Daniel Grant,
Matthew H. Mohebbi,
Sean Mooney,
Christine Suver,
John Wilbanks,
Lara Mangravite,
Patrick Heagerty,
Pat Arean,
Larsson Omberg
Abstract:
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generaliza…
▽ More
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generalizability of findings from these studies. We report the challenges in retention and recruitment in eight remote digital health studies comprising over 100,000 participants who participated for more than 850,000 days, completing close to 3.5 million remote health evaluations. Survival modeling surfaced several factors significantly associated(P < 1e-16) with increase in median retention time i) Clinician referral(increase of 40 days), ii) Effect of compensation (22 days), iii) Clinical conditions of interest to the study (7 days) and iv) Older adults(4 days). Additionally, four distinct patterns of daily app usage behavior that were also associated(P < 1e-10) with participant demographics were identified. Most studies were not able to recruit a representative sample, either demographically or regionally. Combined together these findings can help inform recruitment and retention strategies to enable equitable participation of populations in future digital health research.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Embedded EthiCS: Integrating Ethics Broadly Across Computer Science Education
Authors:
Barbara J. Grosz,
David Gray Grant,
Kate Vredenburgh,
Jeff Behrends,
Lily Hu,
Alison Simmons,
Jim Waldo
Abstract:
Computing technologies have become pervasive in daily life, sometimes bringing unintended but harmful consequences. For students to learn to think not only about what technology they could create, but also about what technology they should create, computer science curricula must expand to include ethical reasoning about the societal value and impact of these technologies. This paper presents Embed…
▽ More
Computing technologies have become pervasive in daily life, sometimes bringing unintended but harmful consequences. For students to learn to think not only about what technology they could create, but also about what technology they should create, computer science curricula must expand to include ethical reasoning about the societal value and impact of these technologies. This paper presents Embedded EthiCS, a novel approach to integrating ethics into computer science education that incorporates ethical reasoning throughout courses in the standard computer science curriculum. It thus changes existing courses rather than requiring wholly new courses. The paper describes a pilot Embedded EthiCS program that embeds philosophers teaching ethical reasoning directly into computer science courses. It discusses lessons learned and challenges to implementing such a program across different types of academic institutions.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.
-
Detecting Homoglyph Attacks with a Siamese Neural Network
Authors:
Jonathan Woodbridge,
Hyrum S. Anderson,
Anjum Ahuja,
Daniel Grant
Abstract:
A homoglyph (name spoofing) attack is a common technique used by adversaries to obfuscate file and domain names. This technique creates process or domain names that are visually similar to legitimate and recognized names. For instance, an attacker may create malware with the name svch0st.exe so that in a visual inspection of running processes or a directory listing, the process or file name might…
▽ More
A homoglyph (name spoofing) attack is a common technique used by adversaries to obfuscate file and domain names. This technique creates process or domain names that are visually similar to legitimate and recognized names. For instance, an attacker may create malware with the name svch0st.exe so that in a visual inspection of running processes or a directory listing, the process or file name might be mistaken as the Windows system process svchost.exe. There has been limited published research on detecting homoglyph attacks. Current approaches rely on string comparison algorithms (such as Levenshtein distance) that result in computationally heavy solutions with a high number of false positives. In addition, there is a deficiency in the number of publicly available datasets for reproducible research, with most datasets focused on phishing attacks, in which homoglyphs are not always used. This paper presents a fundamentally different solution to this problem using a Siamese convolutional neural network (CNN). Rather than leveraging similarity based on character swaps and deletions, this technique uses a learned metric on strings rendered as images: a CNN learns features that are optimized to detect visual similarity of the rendered strings. The trained model is used to convert thousands of potentially targeted process or domain names to feature vectors. These feature vectors are indexed using randomized KD-Trees to make similarity searches extremely fast with minimal computational processing. This technique shows a considerable 13% to 45% improvement over baseline techniques in terms of area under the receiver operating characteristic curve (ROC AUC). In addition, we provide both code and data to further future research.
△ Less
Submitted 24 May, 2018;
originally announced May 2018.
-
Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization
Authors:
Nicolas Y. Masse,
Gregory D. Grant,
David J. Freedman
Abstract:
Humans and most animals can learn new tasks without forgetting old ones. However, training artificial neural networks (ANNs) on new tasks typically cause it to forget previously learned tasks. This phenomenon is the result of "catastrophic forgetting", in which training an ANN disrupts connection weights that were important for solving previous tasks, degrading task performance. Several recent stu…
▽ More
Humans and most animals can learn new tasks without forgetting old ones. However, training artificial neural networks (ANNs) on new tasks typically cause it to forget previously learned tasks. This phenomenon is the result of "catastrophic forgetting", in which training an ANN disrupts connection weights that were important for solving previous tasks, degrading task performance. Several recent studies have proposed methods to stabilize connection weights of ANNs that are deemed most important for solving a task, which helps alleviate catastrophic forgetting. Here, drawing inspiration from algorithms that are believed to be implemented in vivo, we propose a complementary method: adding a context-dependent gating signal, such that only sparse, mostly non-overlap** patterns of units are active for any one task. This method is easy to implement, requires little computational overhead, and allows ANNs to maintain high performance across large numbers of sequentially presented tasks when combined with weight stabilization. This work provides another example of how neuroscience-inspired algorithms can benefit ANN design and capability.
△ Less
Submitted 3 April, 2019; v1 submitted 2 February, 2018;
originally announced February 2018.
-
Comparing Rule-Based and Deep Learning Models for Patient Phenoty**
Authors:
Sebastian Gehrmann,
Franck Dernoncourt,
Yeran Li,
Eric T. Carlson,
Joy T. Wu,
Jonathan Welt,
John Foote Jr.,
Edward T. Moseley,
David W. Grant,
Patrick D. Tyler,
Leo Anthony Celi
Abstract:
Objective: We investigate whether deep learning techniques for natural language processing (NLP) can be used efficiently for patient phenoty**. Patient phenoty** is a classification task for determining whether a patient has a medical condition, and is a crucial part of secondary analysis of healthcare data. We assess the performance of deep learning algorithms and compare them with classical…
▽ More
Objective: We investigate whether deep learning techniques for natural language processing (NLP) can be used efficiently for patient phenoty**. Patient phenoty** is a classification task for determining whether a patient has a medical condition, and is a crucial part of secondary analysis of healthcare data. We assess the performance of deep learning algorithms and compare them with classical NLP approaches.
Materials and Methods: We compare convolutional neural networks (CNNs), n-gram models, and approaches based on cTAKES that extract pre-defined medical concepts from clinical notes and use them to predict patient phenotypes. The performance is tested on 10 different phenoty** tasks using 1,610 discharge summaries extracted from the MIMIC-III database.
Results: CNNs outperform other phenoty** algorithms in all 10 tasks. The average F1-score of our model is 76 (PPV of 83, and sensitivity of 71) with our model having an F1-score up to 37 points higher than alternative approaches. We additionally assess the interpretability of our model by presenting a method that extracts the most salient phrases for a particular prediction.
Conclusion: We show that NLP methods based on deep learning improve the performance of patient phenoty**. Our CNN-based algorithm automatically learns the phrases associated with each patient phenotype. As such, it reduces the annotation complexity for clinical domain experts, who are normally required to develop task-specific annotation rules and identify relevant phrases. Our method performs well in terms of both performance and interpretability, which indicates that deep learning is an effective approach to patient phenoty** based on clinicians' notes.
△ Less
Submitted 25 March, 2017;
originally announced March 2017.
-
Predicting Domain Generation Algorithms with Long Short-Term Memory Networks
Authors:
Jonathan Woodbridge,
Hyrum S. Anderson,
Anjum Ahuja,
Daniel Grant
Abstract:
Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C&C) server. In order to block DGA C&C traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generating a list of domains for a given seed. The domains are then either preregistered…
▽ More
Various families of malware use domain generation algorithms (DGAs) to generate a large number of pseudo-random domain names to connect to a command and control (C&C) server. In order to block DGA C&C traffic, security organizations must first discover the algorithm by reverse engineering malware samples, then generating a list of domains for a given seed. The domains are then either preregistered or published in a DNS blacklist. This process is not only tedious, but can be readily circumvented by malware authors using a large number of seeds in algorithms with multivariate recurrence properties (e.g., banjori) or by using a dynamic list of seeds (e.g., bedep). Another technique to stop malware from using DGAs is to intercept DNS queries on a network and predict whether domains are DGA generated. Such a technique will alert network administrators to the presence of malware on their networks. In addition, if the predictor can also accurately predict the family of DGAs, then network administrators can also be alerted to the type of malware that is on their networks. This paper presents a DGA classifier that leverages long short-term memory (LSTM) networks to predict DGAs and their respective families without the need for a priori feature extraction. Results are significantly better than state-of-the-art techniques, providing 0.9993 area under the receiver operating characteristic curve for binary classification and a micro-averaged F1 score of 0.9906. In other terms, the LSTM technique can provide a 90% detection rate with a 1:10000 false positive (FP) rate---a twenty times FP improvement over comparable methods. Experiments in this paper are run on open datasets and code snippets are provided to reproduce the results.
△ Less
Submitted 2 November, 2016;
originally announced November 2016.
-
The IceProd Framework: Distributed Data Processing for the IceCube Neutrino Observatory
Authors:
M. G. Aartsen,
R. Abbasi,
M. Ackermann,
J. Adams,
J. A. Aguilar,
M. Ahlers,
D. Altmann,
C. Arguelles,
J. Auffenberg,
X. Bai,
M. Baker,
S. W. Barwick,
V. Baum,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
K. -H. Becker,
S. BenZvi,
P. Berghaus,
D. Berley,
E. Bernardini,
A. Bernhard,
D. Z. Besson,
G. Binder,
D. Bindig
, et al. (262 additional authors not shown)
Abstract:
IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It…
▽ More
IceCube is a one-gigaton instrument located at the geographic South Pole, designed to detect cosmic neutrinos, iden- tify the particle nature of dark matter, and study high-energy neutrinos themselves. Simulation of the IceCube detector and processing of data require a significant amount of computational resources. IceProd is a distributed management system based on Python, XML-RPC and GridFTP. It is driven by a central database in order to coordinate and admin- ister production of simulations and processing of data produced by the IceCube detector. IceProd runs as a separate layer on top of other middleware and can take advantage of a variety of computing resources, including grids and batch systems such as CREAM, Condor, and PBS. This is accomplished by a set of dedicated daemons that process job submission in a coordinated fashion through the use of middleware plugins that serve to abstract the details of job submission and job management from the framework.
△ Less
Submitted 22 August, 2014; v1 submitted 22 November, 2013;
originally announced November 2013.
-
Higher genus universally decodable matrices (UDMG)
Authors:
Steve Limburg,
David Grant,
Mahesh K. Varanasi
Abstract:
We introduce the notion of Universally Decodable Matrices of Genus g (UDMG), which for g=0 reduces to the notion of Universally Decodable Matrices (UDM) introduced in [8]. A UDMG is a set of L matrices over a finite field, each with K rows, and a linear independence condition satisfied by collections of K+g columns formed from the initial segments of the matrices. We consider the mathematical stru…
▽ More
We introduce the notion of Universally Decodable Matrices of Genus g (UDMG), which for g=0 reduces to the notion of Universally Decodable Matrices (UDM) introduced in [8]. A UDMG is a set of L matrices over a finite field, each with K rows, and a linear independence condition satisfied by collections of K+g columns formed from the initial segments of the matrices. We consider the mathematical structure of UDMGs and their relation to linear vector codes. We then give a construction of UDMG based on curves of genus g over the finite field, which is a natural generalization of the UDM constructed in [8]. We provide upper (and constructable lower) bounds for L in terms of K, q, g, and the number of columns of the matrices. We will show there is a fundamental trade off (Theorem 5.4) between L and g, akin to the Singleton bound for the minimal Hamming distance of linear vector codes.
△ Less
Submitted 25 January, 2013;
originally announced January 2013.