-
Reusable Formal Verification of DAG-based Consensus Protocols
Authors:
Nathalie Bertrand,
Pranav Ghorpade,
Sasha Rubin,
Bernhard Scholz,
Pavle Subotic
Abstract:
DAG-based consensus protocols are being adoption by blockchain companies to decrease energy footprints and improve security. A DAG-based consensus protocol collaboratively constructs a partial order of blocks of transactions and produces linearly ordered blocks. The ubiquity and strategic importance of blockchains call for formal proof of the correctness of key components, namely, consensus protoc…
▽ More
DAG-based consensus protocols are being adoption by blockchain companies to decrease energy footprints and improve security. A DAG-based consensus protocol collaboratively constructs a partial order of blocks of transactions and produces linearly ordered blocks. The ubiquity and strategic importance of blockchains call for formal proof of the correctness of key components, namely, consensus protocols. This paper presents a safety-proven formal specification of two DAG-based protocols. Our specification highlights several dissemination, DAG construction, and ordering variations that can be combined to express the two protocols. The formalization requires a refinement approach for modeling the consensus. In an abstract model, we first show the safety of DAG-based consensus on leader blocks and then further refine the specification to encompass all blocks for all processes. The TLA+ specification for a given protocol consists of 492-732 lines, and the proof system TLAPS verifies 2025-2294 obligations in 6-8 minutes.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
A Data Source Dependency Analysis Framework for Large Scale Data Science Projects
Authors:
Laurent Boué,
Pratap Kunireddy,
Pavle Subotić
Abstract:
Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by…
▽ More
Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency map** framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any type of dependency on a wide range of source languages and artefacts. The dependency map** framework is exposed as a REST web API where the only input is the path to the Git repository hosting the code base. Currently used by MLOps engineers at Microsoft, we expect such dependency map APIs to be adopted more widely by MLOps engineers in the future.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Abstract Interpretation-Based Data Leakage Static Analysis
Authors:
Filip Drobnjaković,
Pavle Subotić,
Caterina Urban
Abstract:
Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time metho…
▽ More
Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Scalable Typestate Analysis for Low-Latency Environments
Authors:
Alen Arslanagić,
Pavle Subotić,
Jorge A. Pérez
Abstract:
Static analyses based on typestates are important in certifying correctness of code contracts. Such analyses rely on Deterministic Finite Automata (DFAs) to specify properties of an object. We target the analysis of contracts in low-latency environments, where many useful contracts are impractical to codify as DFAs and/or the size of their associated DFAs leads to sub-par performance. To address t…
▽ More
Static analyses based on typestates are important in certifying correctness of code contracts. Such analyses rely on Deterministic Finite Automata (DFAs) to specify properties of an object. We target the analysis of contracts in low-latency environments, where many useful contracts are impractical to codify as DFAs and/or the size of their associated DFAs leads to sub-par performance. To address this bottleneck, we present a lightweight typestate analyzer, based on an expressive specification language that can succinctly specify code contracts. By implementing it in the static analyzer Infer, we demonstrate considerable performance and usability benefits when compared to existing techniques. A central insight is to rely on a sub-class of DFAs with efficient bit-vector operations.
△ Less
Submitted 18 July, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
A Static Analysis Framework for Data Science Notebooks
Authors:
Pavle Subotić,
Lazar Milikić,
Milan Stojić
Abstract:
Notebooks provide an interactive environment for programmers to develop code, analyse data and inject interleaved visualizations in a single environment. Despite their flexibility, a major pitfall that data scientists encounter is unexpected behaviour caused by the unique out-of-order execution model of notebooks. As a result, data scientists face various challenges ranging from notebook correctne…
▽ More
Notebooks provide an interactive environment for programmers to develop code, analyse data and inject interleaved visualizations in a single environment. Despite their flexibility, a major pitfall that data scientists encounter is unexpected behaviour caused by the unique out-of-order execution model of notebooks. As a result, data scientists face various challenges ranging from notebook correctness, reproducibility and cleaning. In this paper, we propose a framework that performs static analysis on notebooks, incorporating their unique execution semantics. Our framework is general in the sense that it accommodate for a wide range of analyses, useful for various notebook use cases. We have instantiated our framework on a diverse set of analyses and have evaluated them on 2211 real world notebooks. Our evaluation demonstrates that the vast majority (98.7%) of notebooks can be analysed in less than a second, well within the time frame required by interactive notebook clients
△ Less
Submitted 25 October, 2021; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Provenance for Large-scale Datalog
Authors:
David Zhao,
Pavle Subotic,
Bernhard Scholz
Abstract:
Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications which model analysis problems, process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug.…
▽ More
Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications which model analysis problems, process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug. While the database community has proposed several data-provenance techniques that address the Declarative Debugging Challenge for Databases, in the cases of analysis problems, these state-of-the-art techniques do not scale.
In this paper, we introduce a novel bottom-up Datalog evaluation strategy for debugging: our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations, and a new fixed-point semantics for semi-naive evaluation. A debugging query mechanism allows arbitrary provenance queries, constructing partial proof trees of tuples with minimal height. We integrate our technique into Souffle, a Datalog engine that synthesizes C++ code, and achieve high performance by using specialized parallel data structures. Experiments are conducted with DOOP/DaCapo, producing proof annotations for tens of millions of output tuples. We show that our method has a runtime overhead of 1.27x on average while being more flexible than existing state-of-the-art techniques.
△ Less
Submitted 11 July, 2019;
originally announced July 2019.
-
Optimal On The Fly Index Selection in Polynomial Time
Authors:
Herbert Jordan,
Bernhard Scholz,
Pavle Subotić
Abstract:
The index selection problem (ISP) is an important problem for accelerating the execution of relational queries, and it has received a lot of attention as a combinatorial knapsack problem in the past. Various solutions to this very hard problem have been provided. In contrast to existing literature, we change the underlying assumptions of the problem definition: we adapt the problem for systems tha…
▽ More
The index selection problem (ISP) is an important problem for accelerating the execution of relational queries, and it has received a lot of attention as a combinatorial knapsack problem in the past. Various solutions to this very hard problem have been provided. In contrast to existing literature, we change the underlying assumptions of the problem definition: we adapt the problem for systems that store relations in memory, and use complex specification languages, e.g., Datalog. In our framework, we decompose complex queries into primitive searches that select tuples in a relation for which an equality predicate holds. A primitive search can be accelerated by an index exhibiting a worst-case run-time complexity of log-linear time in the size of the output result of the primitive search. However, the overheads associated with maintaining indexes are very costly in terms of memory and computing time.
In this work, we present an optimal polynomial-time algorithm that finds the minimal set of indexes of a relation for a given set of primitive searches. An index may cover more than one primitive search due to the algebraic properties of the search predicate, which is a conjunction of equalities over the attributes of a relation. The index search space exhibits a exponential complexity in the number of attributes in a relation, and, hence brute-force algorithms searching for solutions in the index domain are infeasible. As a scaffolding for designing a polynomial-time algorithm, we build a partial order on search operations and use a constructive version of Dilworth's theorem. We show a strong relationship between chains of primitive searches (forming a partial order) and indexes. We demonstrate the effectiveness and efficiency of our algorithm for an in-memory Datalog compiler that is able to process relations with billions of entries in memory.
△ Less
Submitted 12 September, 2017;
originally announced September 2017.
-
Horn Clauses for Communicating Timed Systems
Authors:
Hossein Hojjat,
Philipp Rümmer,
Pavle Subotic,
Wang Yi
Abstract:
Languages based on the theory of timed automata are a well established approach for modelling and analysing real-time systems, with many applications both in industrial and academic context. Model checking for timed automata has been studied extensively during the last two decades; however, even now industrial-grade model checkers are available only for few timed automata dialects (in particular U…
▽ More
Languages based on the theory of timed automata are a well established approach for modelling and analysing real-time systems, with many applications both in industrial and academic context. Model checking for timed automata has been studied extensively during the last two decades; however, even now industrial-grade model checkers are available only for few timed automata dialects (in particular Uppaal timed automata), exhibit limited scalability for systems with large discrete state space, or cannot handle parametrised systems. We explore the use of Horn constraints and off-the-shelf model checkers for analysis of networks of timed automata. The resulting analysis method is fully symbolic and applicable to systems with large or infinite discrete state space, and can be extended to include various language features, for instance Uppaal-style communication/broadcast channels and BIP-style interactions, and systems with infinite parallelism. Experiments demonstrate the feasibility of the method.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.