-
Dead Man's PLC: Towards Viable Cyber Extortion for Operational Technology
Authors:
Richard Derbyshire,
Benjamin Green,
Charl van der Walt,
David Hutchison
Abstract:
For decades, operational technology (OT) has enjoyed the luxury of being suitably inaccessible so as to experience directly targeted cyber attacks from only the most advanced and well-resourced adversaries. However, security via obscurity cannot last forever, and indeed a shift is happening whereby less advanced adversaries are showing an appetite for targeting OT. With this shift in adversary dem…
▽ More
For decades, operational technology (OT) has enjoyed the luxury of being suitably inaccessible so as to experience directly targeted cyber attacks from only the most advanced and well-resourced adversaries. However, security via obscurity cannot last forever, and indeed a shift is happening whereby less advanced adversaries are showing an appetite for targeting OT. With this shift in adversary demographics, there will likely also be a shift in attack goals, from clandestine process degradation and espionage to overt cyber extortion (Cy-X). The consensus from OT cyber security practitioners suggests that, even if encryption-based Cy-X techniques were launched against OT assets, typical recovery practices designed for engineering processes would provide adequate resilience. In response, this paper introduces Dead Man's PLC (DM-PLC), a pragmatic step towards viable OT Cy-X that acknowledges and weaponises the resilience processes typically encountered. Using only existing functionality, DM-PLC considers an entire environment as the entity under ransom, whereby all assets constantly poll one another to ensure the attack remains untampered, treating any deviations as a detonation trigger akin to a Dead Man's switch. A proof of concept of DM-PLC is implemented and evaluated on an academically peer reviewed and industry validated OT testbed to demonstrate its malicious efficacy.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Walking Under the Ladder Logic: PLC-VBS, a PLC Control Logic Vulnerability Discovery Tool
Authors:
Sam Maesschalck,
Alexander Staves,
Richard Derbyshire,
Benjamin Green,
David Hutchison
Abstract:
Cyber security risk assessments provide a pivotal starting point towards the understanding of existing risk exposure, through which suitable mitigation strategies can be formed. Where risk is viewed as a product of threat, vulnerability, and impact, understanding each element is of equal importance. This can be a challenge in Industrial Control System (ICS) environments, where adopted technologies…
▽ More
Cyber security risk assessments provide a pivotal starting point towards the understanding of existing risk exposure, through which suitable mitigation strategies can be formed. Where risk is viewed as a product of threat, vulnerability, and impact, understanding each element is of equal importance. This can be a challenge in Industrial Control System (ICS) environments, where adopted technologies are typically not only bespoke, but interact directly with the physical world. To date, existing vulnerability identification has focused on traditional vulnerability categories. While this provides risk assessors with a baseline understanding, and the ability to hypothesize on potential resulting impacts, it is high level, operating at a level of abstraction that would be viewed as incomplete within a traditional information system context. The work presented in this paper takes the understanding of ICS device vulnerabilities one step further. It offers a tool, PLC-VBS, that helps identify Programmable Logic Controller (PLC) vulnerabilities, specifically within logic used to monitor, control, and automate operational processes. PLC-VBS gives risk assessors a more coherent picture about the potential impact should the identified vulnerabilities be exploited; this applies specifically to operational process elements.
△ Less
Submitted 30 January, 2023; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Future Internet Congestion Control: The Diminishing Feedback Problem
Authors:
Michael Welzl,
Peyman Teymoori,
Safiqul Islam,
David Hutchison,
Stein Gjessing
Abstract:
It is increasingly difficult for Internet congestion control mechanisms to obtain the feedback that they need. This lack of feedback can have severe performance implications, and it is bound to become worse. In the long run, the problem may only be fixable by fundamentally changing the way congestion control is done in the Internet. We substantiate this claim by looking at the evolution of the Int…
▽ More
It is increasingly difficult for Internet congestion control mechanisms to obtain the feedback that they need. This lack of feedback can have severe performance implications, and it is bound to become worse. In the long run, the problem may only be fixable by fundamentally changing the way congestion control is done in the Internet. We substantiate this claim by looking at the evolution of the Internet's infrastructure over the past thirty years, and by examining the most common behavior of Internet traffic. Considering the goals that congestion control mechanisms are intended to address, and taking into account contextual developments in the Internet ecosystem, we arrive at conclusions and recommendations about possible future congestion control design directions. In particular, we argue that congestion control mechanisms should move away from their strict "end-to-end" adherence. This change would benefit from avoiding a "one size fits all circumstances" approach, and moving towards a more selective set of mechanisms that will result in a better performing Internet. We will also discuss how this future vision differs from today's use of Performance Enhancing Proxies (PEPs).
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Resilience Enhancement at Edge Cloud Systems
Authors:
Jose Moura,
David Hutchison
Abstract:
It is becoming common practice to push interactive and location-based services from remote datacenters to resource-constrained edge domains. This trend creates new management challenges at the network edge, not least to ensure resilience. These challenges now need to be investigated and overcome. In this paper, we explore the use of open-source programmable asset orchestration at edge cloud system…
▽ More
It is becoming common practice to push interactive and location-based services from remote datacenters to resource-constrained edge domains. This trend creates new management challenges at the network edge, not least to ensure resilience. These challenges now need to be investigated and overcome. In this paper, we explore the use of open-source programmable asset orchestration at edge cloud systems to guarantee operational resilience and a satisfactory performance level despite system incidents such as faults, congestion, or cyber-attacks. We discuss the design and deployment of a new cross-level configurable solution, Resilient Edge Cloud Systems (RECS). Results from appropriate tests made on RECS highlight the positive effects of deploying novel service and resource management algorithms at both data and control planes of the programmable edge system to mitigate against disruptive events such as control channel issues, service overload, or link congestion. RECS offers the following benefits: i) the switch automatically selects the standalone operation mode after its disconnection from the upper-level controllers; ii) deployment of edge virtualized services is made, according to client requests; iii) the client requests are served by edge services and the related traffic is balanced among the alternative on-demand routing paths to the edge location where each service is available for its clients; iv) the TCP traffic quality is protected from unfair competitiveness of UDP flows; and v) a set of redundant controllers is orchestrated by a top-level multi-thread cluster manager, using a novel management protocol with low overhead.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion and error recovery problems
Authors:
Luke A. D. Hutchison
Abstract:
A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A packrat parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant prob…
▽ More
A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A packrat parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by the pika parser, a novel reformulation of packrat parsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enables pika parsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers. Pika parsing maintains the linear-time performance characteristics of packrat parsing as a function of input length. The pika parser was benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries. The pika parser performed significantly better than the other parsers for an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character. Therefore, if performance is important, pika parsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input. Several new insights into precedence, associativity, and left recursion are presented.
△ Less
Submitted 6 July, 2020; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Fogbanks: Future Dynamic Vehicular Fog Banks for Processing, Sensing and Storage in 6G
Authors:
A. A. Alahmadi,
M. O. I. Musa,
T. E. H. El-Gorashi,
J. M. H. Elmirghani,
S. Grant-Muller,
D. Hutchison,
A. Mauthe,
M. Dianati,
C. Maple,
L. Lefevre,
A. Lason
Abstract:
Fixed edge processing has become a key feature of 5G networks, while playing a key role in reducing latency, improving energy efficiency and introducing flexible compute resource utilization on-demand with added cost savings. Autonomous vehicles are expected to possess significantly more on-board processing capabilities and with improved connectivity. Vehicles continue to be used for a fraction of…
▽ More
Fixed edge processing has become a key feature of 5G networks, while playing a key role in reducing latency, improving energy efficiency and introducing flexible compute resource utilization on-demand with added cost savings. Autonomous vehicles are expected to possess significantly more on-board processing capabilities and with improved connectivity. Vehicles continue to be used for a fraction of the day, and as such there is a potential to increase processing capacity by utilizing these resources while vehicles are in short-term and long-term car parks, in roads and at road intersections. Such car parks and road segments can be transformed, through 6G networks, into vehicular fog clusters, or Fogbanks, that can provide processing, storage and sensing capabilities, making use of underutilized vehicular resources. We introduce the Fogbanks concept, outline current research efforts underway in vehicular clouds, and suggest promising directions for 6G in a world where autonomous driving will become commonplace. Moreover, we study the processing allocation problem in cloud-based Fogbank architecture. We solve this problem using Mixed Integer Programming (MILP) to minimize the total power consumption of the proposed architecture, taking into account two allocation strategies, single allocation of tasks and distributed allocation. Finally, we describe additional future directions needed to establish reliability, security, virtualisation, energy efficiency, business models and standardization.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
Resilient Cyber-Physical Systems: Using NFV Orchestration
Authors:
Jose Moura,
David Hutchison
Abstract:
Cyber-Physical Systems (CPSs) are increasingly important in critical areas of our society such as intelligent power grids, next generation mobile devices, and smart buildings. CPS operation has characteristics including considerable heterogeneity, variable dynamics, and high complexity. These systems have also scarce resources in order to satisfy their entire load demand, which can be divided into…
▽ More
Cyber-Physical Systems (CPSs) are increasingly important in critical areas of our society such as intelligent power grids, next generation mobile devices, and smart buildings. CPS operation has characteristics including considerable heterogeneity, variable dynamics, and high complexity. These systems have also scarce resources in order to satisfy their entire load demand, which can be divided into data processing and service execution. These new characteristics of CPSs need to be managed with novel strategies to ensure their resilient operation. Towards this goal, we propose an SDN-based solution enhanced by distributed Network Function Virtualization (NFV) modules located at the top-most level of our solution architecture. These NFV agents will take orchestrated management decisions among themselves to ensure a resilient CPS configuration against threats, and an optimum operation of the CPS. For this, we study and compare two distinct incentive mechanisms to enforce cooperation among NFVs. Thus, we aim to offer novel perspectives into the management of resilient CPSs, embedding IoT devices, modeled by Game Theory (GT), using the latest software and virtualization platforms.
△ Less
Submitted 1 April, 2020; v1 submitted 26 March, 2020;
originally announced March 2020.
-
Fog Computing Systems: State of the Art, Research Issues and Future Trends, with a Focus on Resilience
Authors:
Jose Moura,
David Hutchison
Abstract:
Many future innovative computing services will use Fog Computing Systems (FCS), integrated with Internet of Things (IoT) resources. These new services, built on the convergence of several distinct technologies, need to fulfil time-sensitive functions, provide variable levels of integration with their environment, and incorporate data storage, computation, communications, sensing, and control. Ther…
▽ More
Many future innovative computing services will use Fog Computing Systems (FCS), integrated with Internet of Things (IoT) resources. These new services, built on the convergence of several distinct technologies, need to fulfil time-sensitive functions, provide variable levels of integration with their environment, and incorporate data storage, computation, communications, sensing, and control. There are, however, significant problems to be solved before such systems can be considered fit for purpose. The high heterogeneity, complexity, and dynamics of these resource-constrained systems bring new challenges to their robust and reliable operation, which implies the need for integral resilience management strategies. This paper surveys the state of the art in the relevant fields, and discusses the research issues and future trends that are emerging. We envisage future applications that have very stringent requirements, notably high-precision latency and synchronization between a large set of flows, where FCSs are key to supporting them. Thus, we hope to provide new insights into the design and management of resilient FCSs that are formed by IoT devices, edge computer servers and wireless sensor networks; these systems can be modelled using Game Theory, and flexibly programmed with the latest software and virtualization platforms.
△ Less
Submitted 17 July, 2020; v1 submitted 14 August, 2019;
originally announced August 2019.
-
On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML
Authors:
Matthias Boehm,
Berthold Reinwald,
Dylan Hutchison,
Alexandre V. Evfimievski,
Prithviraj Sen
Abstract:
Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous. These opportunities include (1) fewer materialized intermediates, (2) fewer scans of input…
▽ More
Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous. These opportunities include (1) fewer materialized intermediates, (2) fewer scans of input data, and (3) the exploitation of sparsity across chains of operators. Automatic operator fusion eliminates the need for hand-written fused operators and significantly improves performance for complex or previously unseen chains of operations. However, existing fusion heuristics struggle to find good fusion plans for complex DAGs or hybrid plans of local and distributed operations. In this paper, we introduce an optimization framework for systematically reason about fusion plans that considers materialization points in DAGs, sparsity exploitation, different fusion template types, as well as local and distributed operations. In detail, we contribute algorithms for (1) candidate exploration of valid fusion plans, (2) cost-based candidate selection, and (3) code generation of local and distributed operations over dense, sparse, and compressed data. Our experiments in SystemML show end-to-end performance improvements with optimized fusion plans of up to 21x compared to hand-written fused operators, with negligible optimization and code generation overhead.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
-
Polystore Mathematics of Relational Algebra
Authors:
Hayden Jananthan,
Ziqi Zhou,
Vijay Gadepally,
Dylan Hutchison,
Suna Kim,
Jeremy Kepner
Abstract:
Financial transactions, internet search, and data analysis are all placing increasing demands on databases. SQL, NoSQL, and NewSQL databases have been developed to meet these demands and each offers unique benefits. SQL, NoSQL, and NewSQL databases also rely on different underlying mathematical models. Polystores seek to provide a mechanism to allow applications to transparently achieve the benefi…
▽ More
Financial transactions, internet search, and data analysis are all placing increasing demands on databases. SQL, NoSQL, and NewSQL databases have been developed to meet these demands and each offers unique benefits. SQL, NoSQL, and NewSQL databases also rely on different underlying mathematical models. Polystores seek to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Integrating the underlying mathematics of these diverse databases can be an important enabler for polystores as it enables effective reasoning across different databases. Associative arrays provide a common approach for the mathematics of polystores by encompassing the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). Prior work presented the SQL relational model in terms of associative arrays and identified key mathematical properties that are preserved within SQL. This work provides the rigorous mathematical definitions, lemmas, and theorems underlying these properties. Specifically, SQL Relational Algebra deals primarily with relations - multisets of tuples - and operations on and between these relations. These relations can be modeled as associative arrays by treating tuples as non-zero rows in an array. Operations in relational algebra are built as compositions of standard operations on associative arrays which mirror their matrix counterparts. These constructions provide insight into how relational algebra can be handled via array operations. As an example application, the composition of two projection operations is shown to also be a projection, and the projection of a union is shown to be equal to the union of the projections.
△ Less
Submitted 3 December, 2017;
originally announced December 2017.
-
Distributed Triangle Counting in the Graphulo Matrix Math Library
Authors:
Dylan Hutchison
Abstract:
Triangle counting is a key algorithm for large graph analysis. The Graphulo library provides a framework for implementing graph algorithms on the Apache Accumulo distributed database. In this work we adapt two algorithms for counting triangles, one that uses the adjacency matrix and another that also uses the incidence matrix, to the Graphulo library for server-side processing inside Accumulo. Clo…
▽ More
Triangle counting is a key algorithm for large graph analysis. The Graphulo library provides a framework for implementing graph algorithms on the Apache Accumulo distributed database. In this work we adapt two algorithms for counting triangles, one that uses the adjacency matrix and another that also uses the incidence matrix, to the Graphulo library for server-side processing inside Accumulo. Cloud-based experiments show a similar performance profile for these different approaches on the family of power law Graph500 graphs, for which data skew increasingly bottlenecks. These results motivate the design of skew-aware hybrid algorithms that we propose for future work.
△ Less
Submitted 5 September, 2017; v1 submitted 20 August, 2017;
originally announced September 2017.
-
D4M 3.0: Extended Database and Language Capabilities
Authors:
Lauren Milechin,
Vijay Gadepally,
Siddharth Samsi,
Jeremy Kepner,
Alexander Chen,
Dylan Hutchison
Abstract:
The D4M tool was developed to address many of today's data needs. This tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of new database engines, including SciDB. D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo database. Finally, an imp…
▽ More
The D4M tool was developed to address many of today's data needs. This tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of new database engines, including SciDB. D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo database. Finally, an implementation using the Julia programming language is also now available. In this article, we describe some of our latest additions to the D4M toolbox and our upcoming D4M 3.0 release. We show through benchmarking and scaling results that we can achieve fast SciDB ingest using the D4M-SciDB connector, that using Graphulo can enable graph algorithms on scales that can be memory limited, and that the Julia implementation of D4M achieves comparable performance or exceeds that of the existing MATLAB(R) implementation.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
Game Theory for Multi-Access Edge Computing: Survey, Use Cases, and Future Trends
Authors:
Jose Moura,
David Hutchison
Abstract:
Game Theory (GT) has been used with significant success to formulate, and either design or optimize, the operation of many representative communications and networking scenarios. The games in these scenarios involve, as usual, diverse players with conflicting goals. This paper primarily surveys the literature that has applied theoretical games to wireless networks, emphasizing use cases of upcomin…
▽ More
Game Theory (GT) has been used with significant success to formulate, and either design or optimize, the operation of many representative communications and networking scenarios. The games in these scenarios involve, as usual, diverse players with conflicting goals. This paper primarily surveys the literature that has applied theoretical games to wireless networks, emphasizing use cases of upcoming Multi-Access Edge Computing (MEC). MEC is relatively new and offers cloud services at the network periphery, aiming to reduce service latency backhaul load, and enhance relevant operational aspects such as Quality of Experience or security. Our presentation of GT is focused on the major challenges imposed by MEC services over the wireless resources. The survey is divided into classical and evolutionary games. Then, our discussion proceeds to more specific aspects which have a considerable impact on the game usefulness, namely: rational vs. evolving strategies, cooperation among players, available game information, the way the game is played (single turn, repeated), the game model evaluation, and how the model results can be applied for both optimizing resource-constrained resources and balancing diverse trade-offs in real edge networking scenarios. Finally, we reflect on lessons learned, highlighting future trends and research directions for applying theoretical model games in upcoming MEC services, considering both network design issues and usage scenarios.
△ Less
Submitted 24 February, 2019; v1 submitted 2 April, 2017;
originally announced April 2017.
-
LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation
Authors:
Dylan Hutchison,
Bill Howe,
Dan Suciu
Abstract:
Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlap** expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as we…
▽ More
Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlap** expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position Lara %formal and informal at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the Lara operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps.
To evaluate these ideas, we implemented the Lara operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how Lara expresses a sensor quality control task, and we measure the performance impact of optimizations Lara admits on this task. Second we show that the LaraDB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LaraDB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans.
△ Less
Submitted 13 April, 2017; v1 submitted 21 March, 2017;
originally announced March 2017.
-
D4M 3.0
Authors:
Lauren Milechin,
Alexander Chen,
Vijay Gadepally,
Dylan Hutchison,
Siddharth Samsi,
Jeremy Kepner
Abstract:
The D4M tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of database engines, graph analytics in the Apache Accumulo database, and an implementation using the Julia programming language. In this article, we describe some of our latest additions to the D4M toolbox an…
▽ More
The D4M tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of database engines, graph analytics in the Apache Accumulo database, and an implementation using the Julia programming language. In this article, we describe some of our latest additions to the D4M toolbox and our upcoming D4M 3.0 release.
△ Less
Submitted 18 January, 2017;
originally announced February 2017.
-
Benchmarking the Graphulo Processing Framework
Authors:
Timothy Weale,
Vijay Gadepally,
Dylan Hutchison,
Jeremy Kepner
Abstract:
Graph algorithms have wide applicablity to a variety of domains and are often used on massive datasets. Recent standardization efforts such as the GraphBLAS specify a set of key computational kernels that hardware and software developers can adhere to. Graphulo is a processing framework that enables GraphBLAS kernels in the Apache Accumulo database. In our previous work, we have demonstrated a cor…
▽ More
Graph algorithms have wide applicablity to a variety of domains and are often used on massive datasets. Recent standardization efforts such as the GraphBLAS specify a set of key computational kernels that hardware and software developers can adhere to. Graphulo is a processing framework that enables GraphBLAS kernels in the Apache Accumulo database. In our previous work, we have demonstrated a core Graphulo operation called \textit{TableMult} that performs large-scale multiplication operations of database tables. In this article, we present the results of scaling the Graphulo engine to larger problems and scalablity when a greater number of resources is used. Specifically, we present two experiments that demonstrate Graphulo scaling performance is linear with the number of available resources. The first experiment demonstrates cluster processing rates through Graphulo's TableMult operator on two large graphs, scaled between $2^{17}$ and $2^{19}$ vertices. The second experiment uses TableMult to extract a random set of rows from a large graph ($2^{19}$ nodes) to simulate a cued graph analytic. These benchmarking results are of relevance to Graphulo users who wish to apply Graphulo to their graph problems.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.
-
Julia Implementation of the Dynamic Distributed Dimensional Data Model
Authors:
Alexander Chen,
Alan Edelman,
Jeremy Kepner,
Vijay Gadepally,
Dylan Hutchison
Abstract:
Julia is a new language for writing data analysis programs that are easy to implement and run at high performance. Similarly, the Dynamic Distributed Dimensional Data Model (D4M) aims to clarify data analysis operations while retaining strong performance. D4M accomplishes these goals through a composable, unified data model on associative arrays. In this work, we present an implementation of D4M i…
▽ More
Julia is a new language for writing data analysis programs that are easy to implement and run at high performance. Similarly, the Dynamic Distributed Dimensional Data Model (D4M) aims to clarify data analysis operations while retaining strong performance. D4M accomplishes these goals through a composable, unified data model on associative arrays. In this work, we present an implementation of D4M in Julia and describe how it enables and facilitates data analysis. Several experiments showcase scalable performance in our new Julia version as compared to the original Matlab implementation.
△ Less
Submitted 13 August, 2016;
originally announced August 2016.
-
From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database
Authors:
Dylan Hutchison,
Jeremy Kepner,
Vijay Gadepally,
Bill Howe
Abstract:
Google BigTable's scale-out design for distributed key-value storage inspired a generation of NoSQL databases. Recently the NewSQL paradigm emerged in response to analytic workloads that demand distributed computation local to data storage. Many such analytics take the form of graph algorithms, a trend that motivated the GraphBLAS initiative to standardize a set of matrix math kernels for building…
▽ More
Google BigTable's scale-out design for distributed key-value storage inspired a generation of NoSQL databases. Recently the NewSQL paradigm emerged in response to analytic workloads that demand distributed computation local to data storage. Many such analytics take the form of graph algorithms, a trend that motivated the GraphBLAS initiative to standardize a set of matrix math kernels for building graph algorithms. In this article we show how it is possible to implement the GraphBLAS kernels in a BigTable database by presenting the design of Graphulo, a library for executing graph algorithms inside the Apache Accumulo database. We detail the Graphulo implementation of two graph algorithms and conduct experiments comparing their performance to two main-memory matrix math systems. Our results shed insight into the conditions that determine when executing a graph algorithm is faster inside a database versus an external system---in short, that memory requirements and relative I/O are critical factors.
△ Less
Submitted 11 August, 2016; v1 submitted 22 June, 2016;
originally announced June 2016.
-
Associative Array Model of SQL, NoSQL, and NewSQL Databases
Authors:
Jeremy Kepner,
Vijay Gadepally,
Dylan Hutchison,
Hayden Jananthan,
Timothy Mattson,
Siddharth Samsi,
Albert Reuther
Abstract:
The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from…
▽ More
The success of SQL, NoSQL, and NewSQL databases is a reflection of their ability to provide significant functionality and performance benefits for specific domains, such as financial transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while insulating applications from the details of these databases. Associative arrays provide a common approach to the mathematics found in different databases: sets (SQL), graphs (NoSQL), and matrices (NewSQL). This work presents the SQL relational model in terms of associative arrays and identifies the key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, distributivity, identities, annihilators, and inverses. Performance measurements on distributivity and associativity show the impact these properties can have on associative array operations. These results demonstrate that associative arrays could provide a mathematical model for polystores to optimize the exchange of data and execution queries.
△ Less
Submitted 18 June, 2016;
originally announced June 2016.
-
Mathematical Foundations of the GraphBLAS
Authors:
Jeremy Kepner,
Peter Aaltonen,
David Bader,
Aydın Buluc,
Franz Franchetti,
John Gilbert,
Dylan Hutchison,
Manoj Kumar,
Andrew Lumsdaine,
Henning Meyerhenke,
Scott McMillan,
Jose Moreira,
John D. Owens,
Carl Yang,
Marcin Zalewski,
Timothy Mattson
Abstract:
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th…
▽ More
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.
△ Less
Submitted 13 July, 2016; v1 submitted 18 June, 2016;
originally announced June 2016.
-
ModelWizard: Toward Interactive Model Construction
Authors:
Dylan Hutchison
Abstract:
Data scientists engage in model construction to discover machine learning models that well explain a dataset, in terms of predictiveness, understandability and generalization across domains. Questions such as "what if we model common cause Z" and "what if Y's dependence on X reverses" inspire many candidate models to consider and compare, yet current tools emphasize constructing a final model all…
▽ More
Data scientists engage in model construction to discover machine learning models that well explain a dataset, in terms of predictiveness, understandability and generalization across domains. Questions such as "what if we model common cause Z" and "what if Y's dependence on X reverses" inspire many candidate models to consider and compare, yet current tools emphasize constructing a final model all at once.
To more naturally reflect exploration when debating numerous models, we propose an interactive model construction framework grounded in composable operations. Primitive operations capture core steps refining data and model that, when verified, form an inductive basis to prove model validity. Derived, composite operations enable advanced model families, both generic and specialized, abstracted away from low-level details.
We prototype our envisioned framework in ModelWizard, a domain-specific language embedded in F# to construct Tabular models. We enumerate language design and demonstrate its use through several applications, emphasizing how language may facilitate creation of complex models. To future engineers designing data science languages and tools, we offer ModelWizard's design as a new model construction paradigm, speeding discovery of our universe's structure.
△ Less
Submitted 15 April, 2016;
originally announced April 2016.
-
Lara: A Key-Value Algebra underlying Arrays and Relations
Authors:
Dylan Hutchison,
Bill Howe,
Dan Suciu
Abstract:
Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned onto multiple families. Discovering ways to execute computation among multiple available systems, let alone discovering an optimal execution plan, is challenging…
▽ More
Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned onto multiple families. Discovering ways to execute computation among multiple available systems, let alone discovering an optimal execution plan, is challenging given semantic differences between disparate families of systems. In this paper we introduce a new algebra, Lara, which underlies and unifies algebras representing the families above in order to facilitate translation between systems. We describe the operations and objects of Lara---union, join, and ext on associative tables---and show her properties and equivalences to other algebras. Multi-system optimization has a bright future, in which we proffer Lara for the role of universal connector.
△ Less
Submitted 12 April, 2016;
originally announced April 2016.
-
Perspectives on Software-Defined Networks: interviews with five leading scientists from the networking community
Authors:
Daniel M Batista,
Gordon Blair,
Fabio Kon,
Raouf Boutaba,
David Hutchison,
Raj Jain,
Ramachandran Ramjee,
Christian E Rothenberg
Abstract:
Software defined Networks (SDNs) have drawn much attention both from academia and industry over the last few years. Despite the fact that underlying ideas already exist through areas such as P2P applications and active networks (e.g. virtual topologies and dynamic changes of the network via software), only now has the technology evolved to a point where it is possible to scale the implementations,…
▽ More
Software defined Networks (SDNs) have drawn much attention both from academia and industry over the last few years. Despite the fact that underlying ideas already exist through areas such as P2P applications and active networks (e.g. virtual topologies and dynamic changes of the network via software), only now has the technology evolved to a point where it is possible to scale the implementations, which justifies the high interest in SDNs nowadays. In this article, the JISA Editors invite five leading scientists from three continents (Raouf Boutaba, David Hutchison, Raj Jain, Ramachandran Ramjee, and Christian Esteve Rothenberg) to give their opinions about what is really new in SDNs. The interviews cover whether big telecom and data center companies need to consider using SDNs, if the new paradigm is changing the way computer networks are understood and taught, and what are the open issues on the topic.
△ Less
Submitted 28 March, 2016;
originally announced March 2016.
-
Review and Analysis of Networking Challenges in Cloud Computing
Authors:
Jose Moura,
David Hutchison
Abstract:
Cloud Computing offers virtualized computing, storage, and networking resources, over the Internet, to organizations and individual users in a completely dynamic way. These cloud resources are cheaper, easier to manage, and more elastic than sets of local, physical, ones. This encourages customers to outsource their applications and services to the cloud. The migration of both data and application…
▽ More
Cloud Computing offers virtualized computing, storage, and networking resources, over the Internet, to organizations and individual users in a completely dynamic way. These cloud resources are cheaper, easier to manage, and more elastic than sets of local, physical, ones. This encourages customers to outsource their applications and services to the cloud. The migration of both data and applications outside the administrative domain of customers into a shared environment imposes transversal, functional problems across distinct platforms and technologies. This article provides a contemporary discussion of the most relevant functional problems associated with the current evolution of Cloud Computing, mainly from the network perspective. The paper also gives a concise description of Cloud Computing concepts and technologies. It starts with a brief history about cloud computing, tracing its roots. Then, architectural models of cloud services are described, and the most relevant products for Cloud Computing are briefly discussed along with a comprehensive literature review. The paper highlights and analyzes the most pertinent and practical network issues of relevance to the provision of high-assurance cloud services through the Internet, including security. Finally, trends and future research directions are also presented.
△ Less
Submitted 23 January, 2016; v1 submitted 20 January, 2016;
originally announced January 2016.
-
Graphulo: Linear Algebra Graph Kernels for NoSQL Databases
Authors:
Vijay Gadepally,
Jake Bolewski,
Dan Hook,
Dylan Hutchison,
Ben Miller,
Jeremy Kepner
Abstract:
Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edge…
▽ More
Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (GraphBLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the GraphBLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the GraphBLAS building blocks.
△ Less
Submitted 5 October, 2015; v1 submitted 28 August, 2015;
originally announced August 2015.
-
Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database
Authors:
Dylan Hutchison,
Jeremy Kepner,
Vijay Gadepally,
Adam Fuchs
Abstract:
The Apache Accumulo database excels at distributed storage and indexing and is ideally suited for storing graph data. Many big data analytics compute on graph data and persist their results back to the database. These graph calculations are often best performed inside the database server. The GraphBLAS standard provides a compact and efficient basis for a wide range of graph applications through a…
▽ More
The Apache Accumulo database excels at distributed storage and indexing and is ideally suited for storing graph data. Many big data analytics compute on graph data and persist their results back to the database. These graph calculations are often best performed inside the database server. The GraphBLAS standard provides a compact and efficient basis for a wide range of graph applications through a small number of sparse matrix operations. In this article, we implement GraphBLAS sparse matrix multiplication server-side by leveraging Accumulo's native, high-performance iterators. We compare the mathematics and performance of inner and outer product implementations, and show how an outer product implementation achieves optimal performance near Accumulo's peak write rate. We offer our work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulo.
△ Less
Submitted 30 August, 2015; v1 submitted 4 July, 2015;
originally announced July 2015.