-
Forensic Data Analytics for Anomaly Detection in Evolving Networks
Authors:
Li Yang,
Abdallah Moubayed,
Abdallah Shami,
Amine Boukhtouta,
Parisa Heidari,
Stere Preda,
Richard Brunner,
Daniel Migault,
Adel Larabi
Abstract:
In the prevailing convergence of traditional infrastructure-based deployment (i.e., Telco and industry operational networks) towards evolving deployments enabled by 5G and virtualization, there is a keen interest in elaborating effective security controls to protect these deployments in-depth. By considering key enabling technologies like 5G and virtualization, evolving networks are democratized,…
▽ More
In the prevailing convergence of traditional infrastructure-based deployment (i.e., Telco and industry operational networks) towards evolving deployments enabled by 5G and virtualization, there is a keen interest in elaborating effective security controls to protect these deployments in-depth. By considering key enabling technologies like 5G and virtualization, evolving networks are democratized, facilitating the establishment of point presences integrating different business models ranging from media, dynamic web content, gaming, and a plethora of IoT use cases. Despite the increasing services provided by evolving networks, many cybercrimes and attacks have been launched in evolving networks to perform malicious activities. Due to the limitations of traditional security artifacts (e.g., firewalls and intrusion detection systems), the research on digital forensic data analytics has attracted more attention. Digital forensic analytics enables people to derive detailed information and comprehensive conclusions from different perspectives of cybercrimes to assist in convicting criminals and preventing future crimes. This chapter presents a digital analytics framework for network anomaly detection, including multi-perspective feature engineering, unsupervised anomaly detection, and comprehensive result correction procedures. Experiments on real-world evolving network data show the effectiveness of the proposed forensic data analytics solution.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Multi-Perspective Content Delivery Networks Security Framework Using Optimized Unsupervised Anomaly Detection
Authors:
Li Yang,
Abdallah Moubayed,
Abdallah Shami,
Parisa Heidari,
Amine Boukhtouta,
Adel Larabi,
Richard Brunner,
Stere Preda,
Daniel Migault
Abstract:
Content delivery networks (CDNs) provide efficient content distribution over the Internet. CDNs improve the connectivity and efficiency of global communications, but their caching mechanisms may be breached by cyber-attackers. Among the security mechanisms, effective anomaly detection forms an important part of CDN security enhancement. In this work, we propose a multi-perspective unsupervised lea…
▽ More
Content delivery networks (CDNs) provide efficient content distribution over the Internet. CDNs improve the connectivity and efficiency of global communications, but their caching mechanisms may be breached by cyber-attackers. Among the security mechanisms, effective anomaly detection forms an important part of CDN security enhancement. In this work, we propose a multi-perspective unsupervised learning framework for anomaly detection in CDNs. In the proposed framework, a multi-perspective feature engineering approach, an optimized unsupervised anomaly detection model that utilizes an isolation forest and a Gaussian mixture model, and a multi-perspective validation method, are developed to detect abnormal behaviors in CDNs mainly from the client Internet Protocol (IP) and node perspectives, therefore to identify the denial of service (DoS) and cache pollution attack (CPA) patterns. Experimental results are presented based on the analytics of eight days of real-world CDN log data provided by a major CDN operator. Through experiments, the abnormal contents, compromised nodes, malicious IPs, as well as their corresponding attack types, are identified effectively by the proposed framework and validated by multiple cybersecurity experts. This shows the effectiveness of the proposed method when applied to real-world CDN data.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
Cost-optimal V2X Service Placement in Distributed Cloud/Edge Environment
Authors:
Abdallah Moubayed,
Abdallah Shami,
Parisa Heidari,
Adel Larabi,
Richard Brunner
Abstract:
Deploying V2X services has become a challenging task. This is mainly due to the fact that such services have strict latency requirements. To meet these requirements, one potential solution is adopting mobile edge computing (MEC). However, this presents new challenges including how to find a cost efficient placement that meets other requirements such as latency. In this work, the problem of cost-op…
▽ More
Deploying V2X services has become a challenging task. This is mainly due to the fact that such services have strict latency requirements. To meet these requirements, one potential solution is adopting mobile edge computing (MEC). However, this presents new challenges including how to find a cost efficient placement that meets other requirements such as latency. In this work, the problem of cost-optimal V2X service placement (CO-VSP) in a distributed cloud/edge environment is formulated. Additionally, a cost-focused delay-aware V2X service placement (DA-VSP) heuristic algorithm is proposed. Simulation results show that both CO-VSP model and DA-VSP algorithm guarantee the QoS requirements of all such services and illustrates the trade-off between latency and deployment cost.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
The 1st Agriculture-Vision Challenge: Methods and Results
Authors:
Mang Tik Chiu,
Xingqian Xu,
Kai Wang,
Jennifer Hobbs,
Naira Hovakimyan,
Thomas S. Huang,
Honghui Shi,
Yunchao Wei,
Zilong Huang,
Alexander Schwing,
Robert Brunner,
Ivan Dozier,
Wyatt Dozier,
Karen Ghandilyan,
David Wilson,
Hyunseong Park,
Junhee Kim,
Sungho Kim,
Qinghui Liu,
Michael C. Kampffmeyer,
Robert Jenssen,
Arnt B. Salberg,
Alexandre Barbosa,
Rodrigo Trevisan,
Bingchen Zhao
, et al. (17 additional authors not shown)
Abstract:
The first Agriculture-Vision Challenge aims to encourage research in develo** novel and effective algorithms for agricultural pattern recognition from aerial images, especially for the semantic segmentation task associated with our challenge dataset. Around 57 participating teams from various countries compete to achieve state-of-the-art in aerial agriculture semantic segmentation. The Agricultu…
▽ More
The first Agriculture-Vision Challenge aims to encourage research in develo** novel and effective algorithms for agricultural pattern recognition from aerial images, especially for the semantic segmentation task associated with our challenge dataset. Around 57 participating teams from various countries compete to achieve state-of-the-art in aerial agriculture semantic segmentation. The Agriculture-Vision Challenge Dataset was employed, which comprises of 21,061 aerial and multi-spectral farmland images. This paper provides a summary of notable methods and results in the challenge. Our submission server and leaderboard will continue to open for researchers that are interested in this challenge dataset and task; the link can be found here.
△ Less
Submitted 23 April, 2020; v1 submitted 21 April, 2020;
originally announced April 2020.
-
Machine Learning for Performance-Aware Virtual Network Function Placement
Authors:
Dimitrios Michael Manias,
Manar Jammal,
Hassan Hawilo,
Abdallah Shami,
Parisa Heidari,
Adel Larabi,
Richard Brunner
Abstract:
With the growing demand for data connectivity, network service providers are faced with the task of reducing their capital and operational expenses while simultaneously improving network performance and addressing the increased connectivity demand. Although Network Function Virtualization (NFV) has been identified as a solution, several challenges must be addressed to ensure its feasibility. In th…
▽ More
With the growing demand for data connectivity, network service providers are faced with the task of reducing their capital and operational expenses while simultaneously improving network performance and addressing the increased connectivity demand. Although Network Function Virtualization (NFV) has been identified as a solution, several challenges must be addressed to ensure its feasibility. In this paper, we address the Virtual Network Function (VNF) placement problem by develo** a machine learning decision tree model that learns from the effective placement of the various VNF instances forming a Service Function Chain (SFC). The model takes several performance-related features from the network as an input and selects the placement of the various VNF instances on network servers with the objective of minimizing the delay between dependent VNF instances. The benefits of using machine learning are realized by moving away from a complex mathematical modelling of the system and towards a data-based understanding of the system. Using the Evolved Packet Core (EPC) as a use case, we evaluate our model on different data center networks and compare it to the BACON algorithm in terms of the delay between interconnected components and the total delay across the SFC. Furthermore, a time complexity analysis is performed to show the effectiveness of the model in NFV applications.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Edge-enabled V2X Service Placement for Intelligent Transportation Systems
Authors:
Abdallah Moubayed,
Abdallah Shami,
Parisa Heidari,
Adel Larabi,
Richard Brunner
Abstract:
Vehicle-to-everything (V2X) communication and services have been garnering significant interest from different stakeholders as part of future intelligent transportation systems (ITSs). This is due to the many benefits they offer. However, many of these services have stringent performance requirements, particularly in terms of the delay/latency. Multi-access/mobile edge computing (MEC) has been pro…
▽ More
Vehicle-to-everything (V2X) communication and services have been garnering significant interest from different stakeholders as part of future intelligent transportation systems (ITSs). This is due to the many benefits they offer. However, many of these services have stringent performance requirements, particularly in terms of the delay/latency. Multi-access/mobile edge computing (MEC) has been proposed as a potential solution for such services by bringing them closer to vehicles. Yet, this introduces a new set of challenges such as where to place these V2X services, especially given the limit computation resources available at edge nodes. To that end, this work formulates the problem of optimal V2X service placement (OVSP) in a hybrid core/edge environment as a binary integer linear programming problem. To the best of our knowledge, no previous work considered the V2X service placement problem while taking into consideration the computational resource availability at the nodes. Moreover, a low-complexity greedy-based heuristic algorithm named "Greedy V2X Service Placement Algorithm" (G-VSPA) was developed to solve this problem. Simulation results show that the OVSP model successfully guarantees and maintains the QoS requirements of all the different V2X services. Additionally, it is observed that the proposed G-VSPA algorithm achieves close to optimal performance while having lower complexity.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis
Authors:
Mang Tik Chiu,
Xingqian Xu,
Yunchao Wei,
Zilong Huang,
Alexander Schwing,
Robert Brunner,
Hrant Khachatrian,
Hovnatan Karapetyan,
Ivan Dozier,
Greg Rose,
David Wilson,
Adrian Tudor,
Naira Hovakimyan,
Thomas S. Huang,
Honghui Shi
Abstract:
The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable ag…
▽ More
The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable agricultural image datasets. Meanwhile, problems in agriculture also pose new challenges in computer vision. For example, semantic segmentation of aerial farmland images requires inference over extremely large-size images with extreme annotation sparsity. These challenges are not present in most of the common object datasets, and we show that they are more challenging than many other aerial image datasets. To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel. We annotate nine types of field anomaly patterns that are most important to farmers. As a pilot study of aerial agricultural semantic segmentation, we perform comprehensive experiments using popular semantic segmentation models; we also propose an effective model designed for aerial agricultural pattern recognition. Our experiments demonstrate several challenges Agriculture-Vision poses to both the computer vision and agriculture communities. Future versions of this dataset will include even more aerial images, anomaly patterns and image channels. More information at https://www.agriculture-vision.com.
△ Less
Submitted 19 March, 2020; v1 submitted 5 January, 2020;
originally announced January 2020.
-
Unsupervised Star Galaxy Classification with Cascade Variational Auto-Encoder
Authors:
Hao Sun,
Jiadong Guo,
Edward J. Kim,
Robert J. Brunner
Abstract:
The increasing amount of data in astronomy provides great challenges for machine learning research. Previously, supervised learning methods achieved satisfactory recognition accuracy for the star-galaxy classification task, based on manually labeled data set. In this work, we propose a novel unsupervised approach for the star-galaxy recognition task, namely Cascade Variational Auto-Encoder (CasVAE…
▽ More
The increasing amount of data in astronomy provides great challenges for machine learning research. Previously, supervised learning methods achieved satisfactory recognition accuracy for the star-galaxy classification task, based on manually labeled data set. In this work, we propose a novel unsupervised approach for the star-galaxy recognition task, namely Cascade Variational Auto-Encoder (CasVAE). Our empirical results show our method outperforms the baseline model in both accuracy and stability.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.
-
Extended Isolation Forest
Authors:
Sahand Hariri,
Matias Carrasco Kind,
Robert J. Brunner
Abstract:
We present an extension to the model-free anomaly detection algorithm, Isolation Forest. This extension, named Extended Isolation Forest (EIF), resolves issues with assignment of anomaly score to given data points. We motivate the problem using heat maps for anomaly scores. These maps suffer from artifacts generated by the criteria for branching operation of the binary tree. We explain this proble…
▽ More
We present an extension to the model-free anomaly detection algorithm, Isolation Forest. This extension, named Extended Isolation Forest (EIF), resolves issues with assignment of anomaly score to given data points. We motivate the problem using heat maps for anomaly scores. These maps suffer from artifacts generated by the criteria for branching operation of the binary tree. We explain this problem in detail and demonstrate the mechanism by which it occurs visually. We then propose two different approaches for improving the situation. First we propose transforming the data randomly before creation of each tree, which results in averaging out the bias. Second, which is the preferred way, is to allow the slicing of the data to use hyperplanes with random slopes. This approach results in remedying the artifact seen in the anomaly score heat maps. We show that the robustness of the algorithm is much improved using this method by looking at the variance of scores of data points distributed along constant level sets. We report AUROC and AUPRC for our synthetic datasets, along with real-world benchmark datasets. We find no appreciable difference in the rate of convergence nor in computation time between the standard Isolation Forest and EIF.
△ Less
Submitted 8 July, 2020; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Structure and Content of the Visible Darknet
Authors:
Georgia Avarikioti,
Roman Brunner,
Aggelos Kiayias,
Roger Wattenhofer,
Dionysis Zindros
Abstract:
In this paper, we analyze the topology and the content found on the "darknet", the set of websites accessible via Tor. We created a darknet spider and crawled the darknet starting from a bootstrap list by recursively following links. We explored the whole connected component of more than 34,000 hidden services, of which we found 10,000 to be online. Contrary to folklore belief, the visible part of…
▽ More
In this paper, we analyze the topology and the content found on the "darknet", the set of websites accessible via Tor. We created a darknet spider and crawled the darknet starting from a bootstrap list by recursively following links. We explored the whole connected component of more than 34,000 hidden services, of which we found 10,000 to be online. Contrary to folklore belief, the visible part of the darknet is surprisingly well-connected through hub websites such as wikis and forums. We performed a comprehensive categorization of the content using supervised machine learning. We observe that about half of the visible dark web content is related to apparently licit activities based on our classifier. A significant amount of content pertains to software repositories, blogs, and activism-related websites. Among unlawful hidden services, most pertain to fraudulent websites, services selling counterfeit goods, and drug markets.
△ Less
Submitted 7 November, 2018; v1 submitted 4 November, 2018;
originally announced November 2018.
-
An NFV and Microservice Based Architecture for On-the-fly Component Provisioning in Content Delivery Networks
Authors:
Narjes Tahghigh Jahromi,
Roch H. Glitho,
Adel Larabi,
Richard Brunner
Abstract:
Content Delivery Networks (CDNs) deliver content (e.g. Web pages, videos) to geographically distributed end-users over the Internet. Some contents do sometimes attract the attention of a large group of end-users. This often leads to flash crowds which can cause major issues such as outage in the CDN. Microservice architectural style aims at decomposing monolithic systems into smaller components wh…
▽ More
Content Delivery Networks (CDNs) deliver content (e.g. Web pages, videos) to geographically distributed end-users over the Internet. Some contents do sometimes attract the attention of a large group of end-users. This often leads to flash crowds which can cause major issues such as outage in the CDN. Microservice architectural style aims at decomposing monolithic systems into smaller components which can be independently deployed, upgraded and disposed. Network Function Virtualization (NFV) is an emerging technology that aims to reduce costs and bring agility by decoupling network functions from the underlying hardware. This paper leverages the NFV and microservice architectural style to propose an architecture for on-the-fly CDN component provisioning to tackle issues such as flash crowds. In the proposed architecture, CDN components are designed as sets of microservices which interact via RESTFul Web services and are provisioned as Virtual Network Functions (VNFs), which are deployed and orchestrated on-the-fly. We have built a prototype in which a CDN surrogate server, designed as a set of microservices, is deployed on-the-fly. The prototype is deployed on SAVI, a Canadian distributed test bed for future Internet applications. The performance is also evaluated.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
Workload Analysis of Blue Waters
Authors:
Matthew D. Jones,
Joseph P. White,
Martins Innus,
Robert L. DeLeon,
Nikolay Simakov,
Jeffrey T. Palmer,
Steven M. Gallo,
Thomas R. Furlani,
Michael Showerman,
Robert Brunner,
Andry Kot,
Gregory Bauer,
Brett Bode,
Jeremy Enos,
William Kramer
Abstract:
Blue Waters is a Petascale-level supercomputer whose mission is to enable the national scientific and research community to solve "grand challenge" problems that are orders of magnitude more complex than can be carried out on other high performance computing systems. Given the important and unique role that Blue Waters plays in the U.S. research portfolio, it is important to have a detailed unders…
▽ More
Blue Waters is a Petascale-level supercomputer whose mission is to enable the national scientific and research community to solve "grand challenge" problems that are orders of magnitude more complex than can be carried out on other high performance computing systems. Given the important and unique role that Blue Waters plays in the U.S. research portfolio, it is important to have a detailed understanding of its workload in order to guide performance optimization both at the software and system configuration level as well as inform architectural balance tradeoffs. Furthermore, understanding the computing requirements of the Blue Water's workload (memory access, IO, communication, etc.), which is comprised of some of the most computationally demanding scientific problems, will help drive changes in future computing architectures, especially at the leading edge. With this objective in mind, the project team carried out a detailed workload analysis of Blue Waters.
△ Less
Submitted 2 March, 2017;
originally announced March 2017.
-
Star-galaxy Classification Using Deep Convolutional Neural Networks
Authors:
Edward J. Kim,
Robert J. Brunner
Abstract:
Most existing star-galaxy classifiers use the reduced summary information from catalogs, requiring careful feature extraction and selection. The latest advances in machine learning that use deep convolutional neural networks allow a machine to automatically learn the features directly from data, minimizing the need for input from human experts. We present a star-galaxy classification framework tha…
▽ More
Most existing star-galaxy classifiers use the reduced summary information from catalogs, requiring careful feature extraction and selection. The latest advances in machine learning that use deep convolutional neural networks allow a machine to automatically learn the features directly from data, minimizing the need for input from human experts. We present a star-galaxy classification framework that uses deep convolutional neural networks (ConvNets) directly on the reduced, calibrated pixel values. Using data from the Sloan Digital Sky Survey (SDSS) and the Canada-France-Hawaii Telescope Lensing Survey (CFHTLenS), we demonstrate that ConvNets are able to produce accurate and well-calibrated probabilistic classifications that are competitive with conventional machine learning techniques. Future advances in deep learning may bring more success with current and forthcoming photometric surveys, such as the Dark Energy Survey (DES) and the Large Synoptic Survey Telescope (LSST), because deep neural networks require very little, manual feature engineering.
△ Less
Submitted 13 October, 2016; v1 submitted 15 August, 2016;
originally announced August 2016.
-
Teaching Data Science
Authors:
Robert J. Brunner,
Edward J. Kim
Abstract:
We describe an introductory data science course, entitled Introduction to Data Science, offered at the University of Illinois at Urbana-Champaign. The course introduced general programming concepts by using the Python programming language with an emphasis on data preparation, processing, and presentation. The course had no prerequisites, and students were not expected to have any programming exper…
▽ More
We describe an introductory data science course, entitled Introduction to Data Science, offered at the University of Illinois at Urbana-Champaign. The course introduced general programming concepts by using the Python programming language with an emphasis on data preparation, processing, and presentation. The course had no prerequisites, and students were not expected to have any programming experience. This introductory course was designed to cover a wide range of topics, from the nature of data, to storage, to visualization, to probability and statistical analysis, to cloud and high performance computing, without becoming overly focused on any one subject. We conclude this article with a discussion of lessons learned and our plans to develop new data science courses.
△ Less
Submitted 25 April, 2016;
originally announced April 2016.
-
SOMz: photometric redshift PDFs with self organizing maps and random atlas
Authors:
M. Carrasco Kind,
R. J. Brunner
Abstract:
In this paper we explore the applicability of the unsupervised machine learning technique of Self Organizing Maps (SOM) to estimate galaxy photometric redshift probability density functions (PDFs). This technique takes a spectroscopic training set, and maps the photometric attributes, but not the redshifts, to a two dimensional surface by using a process of competitive learning where neurons compe…
▽ More
In this paper we explore the applicability of the unsupervised machine learning technique of Self Organizing Maps (SOM) to estimate galaxy photometric redshift probability density functions (PDFs). This technique takes a spectroscopic training set, and maps the photometric attributes, but not the redshifts, to a two dimensional surface by using a process of competitive learning where neurons compete to more closely resemble the training data multidimensional space. The key feature of a SOM is that it retains the topology of the input set, revealing correlations between the attributes that are not easily identified. We test three different 2D topological map**: rectangular, hexagonal, and spherical, by using data from the DEEP2 survey. We also explore different implementations and boundary conditions on the map and also introduce the idea of a random atlas where a large number of different maps are created and their individual predictions are aggregated to produce a more robust photometric redshift PDF. We also introduced a new metric, the $I$-score, which efficiently incorporates different metrics, making it easier to compare different results (from different parameters or different photometric redshift codes). We find that by using a spherical topology map** we obtain a better representation of the underlying multidimensional topology, which provides more accurate results that are comparable to other, state-of-the-art machine learning algorithms. Our results illustrate that unsupervised approaches have great potential for many astronomical problems, and in particular for the computation of photometric redshifts.
△ Less
Submitted 18 December, 2013;
originally announced December 2013.
-
Bring out your codes! Bring out your codes! (Increasing Software Visibility and Re-use)
Authors:
Alice Allen,
Bruce Berriman,
Robert Brunner,
Dan Burger,
Kimberly DuPrie,
Robert J. Hanisch,
Robert Mann,
Jessica Mink,
Christer Sandin,
Keith Shortridge,
Peter Teuben
Abstract:
Progress is being made in code discoverability and preservation, but as discussed at ADASS XXI, many codes still remain hidden from public view. With the Astrophysics Source Code Library (ASCL) now indexed by the SAO/NASA Astrophysics Data System (ADS), the introduction of a new journal, Astronomy & Computing, focused on astrophysics software, and the increasing success of education efforts such a…
▽ More
Progress is being made in code discoverability and preservation, but as discussed at ADASS XXI, many codes still remain hidden from public view. With the Astrophysics Source Code Library (ASCL) now indexed by the SAO/NASA Astrophysics Data System (ADS), the introduction of a new journal, Astronomy & Computing, focused on astrophysics software, and the increasing success of education efforts such as Software Carpentry and SciCoder, the community has the opportunity to set a higher standard for its science by encouraging the release of software for examination and possible reuse. We assembled representatives of the community to present issues inhibiting code release and sought suggestions for tackling these factors.
The session began with brief statements by panelists; the floor was then opened for discussion and ideas. Comments covered a diverse range of related topics and points of view, with apparent support for the propositions that algorithms should be readily available, code used to produce published scientific results should be made available, and there should be discovery mechanisms to allow these to be found easily. With increased use of resources such as GitHub (for code availability), ASCL (for code discovery), and a stated strong preference from the new journal Astronomy & Computing for code release, we expect to see additional progress over the next few years.
△ Less
Submitted 9 December, 2012;
originally announced December 2012.
-
Robust Machine Learning Applied to Terascale Astronomical Datasets
Authors:
Nicholas M. Ball,
Robert J. Brunner,
Adam D. Myers
Abstract:
We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not…
▽ More
We present recent results from the LCDM (Laboratory for Cosmological Data Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and NCSA to deploy supercomputing cluster resources and machine learning algorithms for the mining of terascale astronomical datasets. This is a novel application in the field of astronomy, because we are using such resources for data mining, and not just performing simulations. Via a modified implementation of the NCSA cyberenvironment Data-to-Knowledge, we are able to provide improved classifications for over 100 million stars and galaxies in the Sloan Digital Sky Survey, improved distance measures, and a full exploitation of the simple but powerful k-nearest neighbor algorithm. A driving principle of this work is that our methods should be extensible from current terascale datasets to upcoming petascale datasets and beyond. We discuss issues encountered to-date, and further issues for the transition to petascale. In particular, disk I/O will become a major limiting factor unless the necessary infrastructure is implemented.
△ Less
Submitted 21 April, 2008;
originally announced April 2008.