-
Effective Technical Reviews
Authors:
Scott Ballentine,
Eitan Farchi
Abstract:
There are two ways to check if a program is correct, namely execute it or review it. While executing a program is the ultimate test for its correctness reviewing the program can occur earlier in its development and find problems if done effectively. This work focuses on review techniques. It enables the programmer to effectively review a program and find a range of problems from concurrency to int…
▽ More
There are two ways to check if a program is correct, namely execute it or review it. While executing a program is the ultimate test for its correctness reviewing the program can occur earlier in its development and find problems if done effectively. This work focuses on review techniques. It enables the programmer to effectively review a program and find a range of problems from concurrency to interface issues. The review techniques can be applied in a time constrained industrial development context and are enhanced by knowledge on programming pitfalls.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Using Combinatorial Optimization to Design a High quality LLM Solution
Authors:
Samuel Ackerman,
Eitan Farchi,
Rami Katan,
Orna Raz
Abstract:
We introduce a novel LLM based solution design approach that utilizes combinatorial optimization and sampling. Specifically, a set of factors that influence the quality of the solution are identified. They typically include factors that represent prompt types, LLM inputs alternatives, and parameters governing the generation and design alternatives. Identifying the factors that govern the LLM solut…
▽ More
We introduce a novel LLM based solution design approach that utilizes combinatorial optimization and sampling. Specifically, a set of factors that influence the quality of the solution are identified. They typically include factors that represent prompt types, LLM inputs alternatives, and parameters governing the generation and design alternatives. Identifying the factors that govern the LLM solution quality enables the infusion of subject matter expert knowledge. Next, a set of interactions between the factors are defined and combinatorial optimization is used to create a small subset $P$ that ensures all desired interactions occur in $P$. Each element $p \in P$ is then developed into an appropriate benchmark. Applying the alternative solutions on each combination, $p \in P$ and evaluating the results facilitate the design of a high quality LLM solution pipeline. The approach is especially applicable when the design and evaluation of each benchmark in $P$ is time-consuming and involves manual steps and human evaluation. Given its efficiency the approach can also be used as a baseline to compare and validate an autoML approach that searches over the factors governing the solution.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
Authors:
Swapnaja Achintalwar,
Ioana Baldini,
Djallel Bouneffouf,
Joan Byamugisha,
Maria Chang,
Pierre Dognin,
Eitan Farchi,
Ndivhuwo Makondo,
Aleksandra Mojsilovic,
Manish Nagireddy,
Karthikeyan Natesan Ramamurthy,
Inkit Padhi,
Orna Raz,
Jesus Rios,
Prasanna Sattigeri,
Moninder Singh,
Siphiwe Thwala,
Rosario A. Uceda-Sosa,
Kush R. Varshney
Abstract:
The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentia…
▽ More
The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentially conflicting requirements in context. We lay out three main components of such an Alignment Studio architecture: Framers, Instructors, and Auditors that work in concert to control the behavior of a language model. We illustrate this approach with a running example of aligning a company's internal-facing enterprise chatbot to its business conduct guidelines.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
Authors:
Swapnaja Achintalwar,
Adriana Alvarado Garcia,
Ateret Anaby-Tavor,
Ioana Baldini,
Sara E. Berger,
Bishwaranjan Bhattacharjee,
Djallel Bouneffouf,
Subhajit Chaudhury,
Pin-Yu Chen,
Lamogha Chiazor,
Elizabeth M. Daly,
Kirushikesh DB,
Rogério Abreu de Paula,
Pierre Dognin,
Eitan Farchi,
Soumya Ghosh,
Michael Hind,
Raya Horesh,
George Kour,
Ja Young Lee,
Nishtha Madaan,
Sameep Mehta,
Erik Miehling,
Keerthiram Murugesan,
Manish Nagireddy
, et al. (13 additional authors not shown)
Abstract:
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we presen…
▽ More
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.
△ Less
Submitted 13 June, 2024; v1 submitted 9 March, 2024;
originally announced March 2024.
-
Unveiling Safety Vulnerabilities of Large Language Models
Authors:
George Kour,
Marcel Zalmanovici,
Naama Zwerdling,
Esther Goldbraich,
Ora Nova Fandina,
Ateret Anaby-Tavor,
Orna Raz,
Eitan Farchi
Abstract:
As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subj…
▽ More
As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Predicting Question-Answering Performance of Large Language Models through Semantic Consistency
Authors:
Ella Rabinovich,
Samuel Ackerman,
Orna Raz,
Eitan Farchi,
Ateret Anaby-Tavor
Abstract:
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the da…
▽ More
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community.
We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Data Drift Monitoring for Log Anomaly Detection Pipelines
Authors:
Dipak Wani,
Samuel Ackerman,
Eitan Farchi,
Xiaotong Liu,
Hau-wen Chang,
Sarasi Lalithsena
Abstract:
Logs enable the monitoring of infrastructure status and the performance of associated applications. Logs are also invaluable for diagnosing the root causes of any problems that may arise. Log Anomaly Detection (LAD) pipelines automate the detection of anomalies in logs, providing assistance to site reliability engineers (SREs) in system diagnosis. Log patterns change over time, necessitating updat…
▽ More
Logs enable the monitoring of infrastructure status and the performance of associated applications. Logs are also invaluable for diagnosing the root causes of any problems that may arise. Log Anomaly Detection (LAD) pipelines automate the detection of anomalies in logs, providing assistance to site reliability engineers (SREs) in system diagnosis. Log patterns change over time, necessitating updates to the LAD model defining the `normal' log activity profile. In this paper, we introduce a Bayes Factor-based drift detection method that identifies when intervention, retraining, and updating of the LAD model are required with human involvement. We illustrate our method using sequences of log activity, both from unaltered data, and simulated activity with controlled levels of anomaly contamination, based on real collected log data.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Characterizing how 'distributional' NLP corpora distance metrics are
Authors:
Samuel Ackerman,
George Kour,
Eitan Farchi
Abstract:
A corpus of vector-embedded text documents has some empirical distribution. Given two corpora, we want to calculate a single metric of distance (e.g., Mauve, Frechet Inception) between them. We describe an abstract quality, called `distributionality', of such metrics. A non-distributional metric tends to use very local measurements, or uses global measurements in a way that does not fully reflect…
▽ More
A corpus of vector-embedded text documents has some empirical distribution. Given two corpora, we want to calculate a single metric of distance (e.g., Mauve, Frechet Inception) between them. We describe an abstract quality, called `distributionality', of such metrics. A non-distributional metric tends to use very local measurements, or uses global measurements in a way that does not fully reflect the distributions' true distance. For example, if individual pairwise nearest-neighbor distances are low, it may judge the two corpora to have low distance, even if their two distributions are in fact far from each other. A more distributional metric will, in contrast, better capture the distributions' overall distance. We quantify this quality by constructing a Known-Similarity Corpora set from two paraphrase corpora and calculating the distance between paired corpora from it. The distances' trend shape as set element separation increases should quantify the distributionality of the metric. We propose that Average Hausdorff Distance and energy distance between corpora are representative examples of non-distributional and distributional distance metrics, to which other metrics can be compared, to evaluate how distributional they are.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Automatic Generation of Attention Rules For Containment of Machine Learning Model Errors
Authors:
Samuel Ackerman,
Axel Bendavid,
Eitan Farchi,
Orna Raz
Abstract:
Machine learning (ML) solutions are prevalent in many applications. However, many challenges exist in making these solutions business-grade. For instance, maintaining the error rate of the underlying ML models at an acceptably low level. Typically, the true relationship between feature inputs and the target feature to be predicted is uncertain, and hence statistical in nature. The approach we prop…
▽ More
Machine learning (ML) solutions are prevalent in many applications. However, many challenges exist in making these solutions business-grade. For instance, maintaining the error rate of the underlying ML models at an acceptably low level. Typically, the true relationship between feature inputs and the target feature to be predicted is uncertain, and hence statistical in nature. The approach we propose is to separate the observations that are the most likely to be predicted incorrectly into 'attention sets'. These can directly aid model diagnosis and improvement, and be used to decide on alternative courses of action for these problematic observations. We present several algorithms (`strategies') for determining optimal rules to separate these observations. In particular, we prefer strategies that use feature-based slicing because they are human-interpretable, model-agnostic, and require minimal supplementary inputs or knowledge. In addition, we show that these strategies outperform several common baselines, such as selecting observations with prediction confidence below a threshold. To evaluate strategies, we introduce metrics to measure various desired qualities, such as their performance, stability, and generalizability to unseen data; the strategies are evaluated on several publicly-available datasets. We use TOPSIS, a Multiple Criteria Decision Making method, to aggregate these metrics into a single quality score for each strategy, to allow comparison.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
Convex Bounds on the Softmax Function with Applications to Robustness Verification
Authors:
Dennis Wei,
Haoze Wu,
Min Wu,
Pin-Yu Chen,
Clark Barrett,
Eitan Farchi
Abstract:
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models. We derive bounds using both a natural exponential-reciprocal decom…
▽ More
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models. We derive bounds using both a natural exponential-reciprocal decomposition of the softmax as well as an alternative decomposition in terms of the log-sum-exp function. The new bounds are provably and/or numerically tighter than linear bounds obtained in previous work on robustness verification of transformers. As illustrations of the utility of the bounds, we apply them to verification of transformers as well as of the robustness of predictive uncertainty estimates of deep ensembles.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Quality Engineering for Agile and DevOps on the Cloud and Edge
Authors:
Eitan Farchi,
Saritha Route
Abstract:
Today's software projects include enhancements, fixes, and patches need to be delivered almost on a daily basis to clients. Weekly and daily releases are pretty much the norm and sit alongside larger feature upgrades and quarterly releases. Software delivery has to be more agile now than ever before. Companies that were, in the past, experimenting with agile based delivery models, are now looking…
▽ More
Today's software projects include enhancements, fixes, and patches need to be delivered almost on a daily basis to clients. Weekly and daily releases are pretty much the norm and sit alongside larger feature upgrades and quarterly releases. Software delivery has to be more agile now than ever before. Companies that were, in the past, experimenting with agile based delivery models, are now looking to scale it to enterprise grade. This shifts the need from the ability to build and execute tests rapidly, to using different means, technologies and procedures to provide rapid and insightful validation sequences and tests to establish quality withing the manufacturing cycle. This book addresses the need of effectively embedding quality engineering throughout the agile development cycle thus addressing the need for enterprise scale high quality agile development.
△ Less
Submitted 16 February, 2024; v1 submitted 7 February, 2023;
originally announced February 2023.
-
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora
Authors:
George Kour,
Samuel Ackerman,
Orna Raz,
Eitan Farchi,
Boaz Carmeli,
Ateret Anaby-Tavor
Abstract:
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. However, standard methods for evaluating these metrics have yet to be established. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics, allowing sensible comparison of their beha…
▽ More
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. However, standard methods for evaluating these metrics have yet to be established. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics, allowing sensible comparison of their behavior. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by evaluating them on a collection of classical and state-of-the-art metrics. Our measures revealed that recently-developed metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Random Test Generation of Application Programming Interfaces
Authors:
Eitan Farchi,
Krithika Prakash,
Vitali Sokhin
Abstract:
Cloud high quality API (Application Programming Interface) testing is essential for supporting the API economy. Autotest is a random test generator that addresses this need. It reads the API specification and deduces a model used in the test generation. This paper describes Autotest. It also address the topic of API specification pitfalls which Autotest may reveal when reading the specification. A…
▽ More
Cloud high quality API (Application Programming Interface) testing is essential for supporting the API economy. Autotest is a random test generator that addresses this need. It reads the API specification and deduces a model used in the test generation. This paper describes Autotest. It also address the topic of API specification pitfalls which Autotest may reveal when reading the specification. A best practice is to add an appropriate test to the regression once a problem is revealed and solved. How to do that in the context of Autotest's random test generation is covered.
△ Less
Submitted 6 November, 2022; v1 submitted 26 July, 2022;
originally announced July 2022.
-
Using Fuzzy Matching of Queries to optimize Database workloads
Authors:
Sweta Singh,
Vaibhav Kulkarni,
Mario Briggs,
Deepak Mahajan,
Eitan Farchi
Abstract:
Directed Acyclic Graphs (DAGs) are commonly used in Databases and Big Data computational engines like Apache Spark for representing the execution plan of queries. We refer to such graphs as Query Directed Acyclic Graphs (QDAGs). This paper uses similarity hashing to arrive at a fingerprint such that the fingerprint embodies the compute requirements of the query for QDAGs. The fingerprint, thus obt…
▽ More
Directed Acyclic Graphs (DAGs) are commonly used in Databases and Big Data computational engines like Apache Spark for representing the execution plan of queries. We refer to such graphs as Query Directed Acyclic Graphs (QDAGs). This paper uses similarity hashing to arrive at a fingerprint such that the fingerprint embodies the compute requirements of the query for QDAGs. The fingerprint, thus obtained, can be used to predict the runtime behaviour of a query based on queries executed in the past having similar QDAGs. We discuss two approaches to arrive at a fingerprint, their pros and cons and how aspects of both approaches can be combined to improve the predictions. Using a hybrid approach, we demonstrate that we are able to predict runtime behaviour of a QDAG with more than 80% accuracy.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
High-quality Conversational Systems
Authors:
Samuel Ackerman,
Ateret Anaby-Tavor,
Eitan Farchi,
Esther Goldbraich,
George Kour,
Ella Rabinovich,
Orna Raz,
Saritha Route,
Marcel Zalmanovici,
Naama Zwerdling
Abstract:
Conversational systems or chatbots are an example of AI-Infused Applications (AIIA). Chatbots are especially important as they are often the first interaction of clients with a business and are the entry point of a business into the AI (Artificial Intelligence) world. The quality of the chatbot is, therefore, key. However, as is the case in general with AIIAs, it is especially challenging to asses…
▽ More
Conversational systems or chatbots are an example of AI-Infused Applications (AIIA). Chatbots are especially important as they are often the first interaction of clients with a business and are the entry point of a business into the AI (Artificial Intelligence) world. The quality of the chatbot is, therefore, key. However, as is the case in general with AIIAs, it is especially challenging to assess and control the quality of chatbot systems. Beyond the inherent statistical nature of these systems, where occasional failure is acceptable, we identify two major challenges. The first is to release an initial system that is of sufficient quality such that humans will interact with it. The second is to maintain the quality, enhance its capabilities, improve it and make necessary adjustments based on changing user requests or drift. These challenges exist because it is impossible to predict the real distribution of user requests and the natural language they will use to express these requests. Moreover, any empirical distribution of requests is likely to change over time. This may be due to periodicity, changing usage, and drift of topics.
We provide a methodology and set of technologies to address these challenges and to provide automated assistance through a human-in-the-loop approach. We notice that it is crucial to connect between the different phases in the lifecycle of the chatbot development and to make sure it provides its expected business value. For example, that it frees human agents to deal with tasks other than answering human users. Our methodology and technologies apply during chatbot training in the pre-production phase, through to chatbot usage in the field in the post-production phase. They implement the `test first' paradigm by assisting in agile design, and support continuous integration through actionable insights.
△ Less
Submitted 28 April, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Generalized Coverage Criteria for Combinatorial Sequence Testing
Authors:
Achiya Elyasaf,
Eitan Farchi,
Oded Margalit,
Gera Weiss,
Yeshayahu Weiss
Abstract:
We present a new model-based approach for testing systems that use sequences of actions and assertions as test vectors. Our solution includes a method for quantifying testing quality, a tool for generating high-quality test suites based on the coverage criteria we propose, and a framework for assessing risks. For testing quality, we propose a method that specifies generalized coverage criteria ove…
▽ More
We present a new model-based approach for testing systems that use sequences of actions and assertions as test vectors. Our solution includes a method for quantifying testing quality, a tool for generating high-quality test suites based on the coverage criteria we propose, and a framework for assessing risks. For testing quality, we propose a method that specifies generalized coverage criteria over sequences of actions, which extends previous approaches. Our publicly available tool demonstrates how to extract effective test suites from test plans based on these criteria. We also present a Bayesian approach for measuring the probabilities of bugs or risks, and show how this quantification can help achieve an informed balance between exploitation and exploration in testing. Finally, we provide an empirical evaluation demonstrating the effectiveness of our tool in finding bugs, assessing risks, and achieving coverage.
△ Less
Submitted 31 October, 2023; v1 submitted 3 January, 2022;
originally announced January 2022.
-
Theory and Practice of Quality Assurance for Machine Learning Systems An Experiment Driven Approach
Authors:
Samuel Ackerman,
Guy Barash,
Eitan Farchi,
Orna Raz,
Onn Shehory
Abstract:
The crafting of machine learning (ML) based systems requires statistical control throughout its life cycle. Careful quantification of business requirements and identification of key factors that impact the business requirements reduces the risk of a project failure. The quantification of business requirements results in the definition of random variables representing the system key performance ind…
▽ More
The crafting of machine learning (ML) based systems requires statistical control throughout its life cycle. Careful quantification of business requirements and identification of key factors that impact the business requirements reduces the risk of a project failure. The quantification of business requirements results in the definition of random variables representing the system key performance indicators that need to be analyzed through statistical experiments. In addition, available data for training and experiment results impact the design of the system. Once the system is developed, it is tested and continually monitored to ensure it meets its business requirements. This is done through the continued application of statistical experiments to analyze and control the key performance indicators. This book teaches the art of crafting and develo** ML based systems. It advocates an "experiment first" approach stressing the need to define statistical experiments from the beginning of the project life cycle. It also discusses in detail how to apply statistical control on the ML based system throughout its lifecycle.
△ Less
Submitted 12 April, 2022; v1 submitted 2 January, 2022;
originally announced January 2022.
-
Using sequential drift detection to test the API economy
Authors:
Samuel Ackerman,
Parijat Dube,
Eitan Farchi
Abstract:
The API economy refers to the widespread integration of API (advanced programming interface) microservices, where software applications can communicate with each other, as a crucial element in business models and functions. The number of possible ways in which such a system could be used is huge. It is thus desirable to monitor the usage patterns and identify when the system is used in a way that…
▽ More
The API economy refers to the widespread integration of API (advanced programming interface) microservices, where software applications can communicate with each other, as a crucial element in business models and functions. The number of possible ways in which such a system could be used is huge. It is thus desirable to monitor the usage patterns and identify when the system is used in a way that was never used before. This provides a warning to the system analysts and they can ensure uninterrupted operation of the system.
In this work we analyze both histograms and call graph of API usage to determine if the usage patterns of the system has shifted. We compare the application of nonparametric statistical and Bayesian sequential analysis to the problem. This is done in a way that overcomes the issue of repeated statistical tests and insures statistical significance of the alerts. The technique was simulated and tested and proven effective in detecting the drift in various scenarios. We also mention modifications to the technique to decrease its memory so that it can respond more quickly when the distribution drift occurs at a delay from when monitoring begins.
△ Less
Submitted 25 November, 2021; v1 submitted 9 November, 2021;
originally announced November 2021.
-
Detecting model drift using polynomial relations
Authors:
Eliran Roffe,
Samuel Ackerman,
Orna Raz,
Eitan Farchi
Abstract:
Machine learning models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training and in the field come from the same underlying unknown distribution. Often, this assumption is broken in practice. It is desirable to identify when this occurs, to minimize the impact on model performance.
We suggest…
▽ More
Machine learning models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training and in the field come from the same underlying unknown distribution. Often, this assumption is broken in practice. It is desirable to identify when this occurs, to minimize the impact on model performance.
We suggest a new approach to detecting change in the data distribution by identifying polynomial relations between the data features. We measure the strength of each identified relation using its R-square value. A strong polynomial relation captures a significant trait of the data which should remain stable if the data distribution does not change. We thus use a set of learned strong polynomial relations to identify drift. For a set of polynomial relations that are stronger than a given threshold, we calculate the amount of drift observed for that relation. The amount of drift is measured by calculating the Bayes Factor for the polynomial relation likelihood of the baseline data versus field data. We empirically validate the approach by simulating a range of changes, and identify drift using the Bayes Factor of the polynomial relation likelihood change.
△ Less
Submitted 22 December, 2021; v1 submitted 24 October, 2021;
originally announced October 2021.
-
Density-based interpretable hypercube region partitioning for mixed numeric and categorical data
Authors:
Samuel Ackerman,
Eitan Farchi,
Orna Raz,
Marcel Zalmanovici,
Maya Zohar
Abstract:
Consider a structured dataset of features, such as $\{\textrm{SEX}, \textrm{INCOME}, \textrm{RACE}, \textrm{EXPERIENCE}\}$. A user may want to know where in the feature space observations are concentrated, and where it is sparse or empty. The existence of large sparse or empty regions can provide domain knowledge of soft or hard feature constraints (e.g., what is the typical income range, or that…
▽ More
Consider a structured dataset of features, such as $\{\textrm{SEX}, \textrm{INCOME}, \textrm{RACE}, \textrm{EXPERIENCE}\}$. A user may want to know where in the feature space observations are concentrated, and where it is sparse or empty. The existence of large sparse or empty regions can provide domain knowledge of soft or hard feature constraints (e.g., what is the typical income range, or that it may be unlikely to have a high income with few years of work experience). Also, these can suggest to the user that machine learning (ML) model predictions for data inputs in sparse or empty regions may be unreliable.
An interpretable region is a hyper-rectangle, such as $\{\textrm{RACE} \in\{\textrm{Black}, \textrm{White}\}\}\:\&$ $\{10 \leq \:\textrm{EXPERIENCE} \:\leq 13\}$, containing all observations satisfying the constraints; typically, such regions are defined by a small number of features. Our method constructs an observation density-based partition of the observed feature space in the dataset into such regions. It has a number of advantages over others in that it works on features of mixed type (numeric or categorical) in the original domain, and can separate out empty regions as well.
As can be seen from visualizations, the resulting partitions accord with spatial grou**s that a human eye might identify; the results should thus extend to higher dimensions. We also show some applications of the partition to other data analysis tasks, such as inferring about ML model error, measuring high-dimensional density variability, and causal inference for treatment effect. Many of these applications are made possible by the hyper-rectangular form of the partition regions.
△ Less
Submitted 8 November, 2021; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Towards API Testing Across Cloud and Edge
Authors:
Samuel Ackerman,
Sanjib Choudhury,
Nirmit Desai,
Eitan Farchi,
Dan Gisolfi,
Andrew Hicks,
Saritha Route,
Diptikalyan Saha
Abstract:
API economy is driving the digital transformation of business applications across the hybrid Cloud and edge environments. For such transformations to succeed, end-to-end testing of the application API composition is required. Testing of API compositions, even in centralized Cloud environments, is challenging as it requires coverage of functional as well as reliability requirements. The combinatori…
▽ More
API economy is driving the digital transformation of business applications across the hybrid Cloud and edge environments. For such transformations to succeed, end-to-end testing of the application API composition is required. Testing of API compositions, even in centralized Cloud environments, is challenging as it requires coverage of functional as well as reliability requirements. The combinatorial space of scenarios is huge, e.g., API input parameters, order of API execution, and network faults. Hybrid Cloud and edge environments exacerbate the challenge of API testing due to the need to coordinate test execution across dynamic wide-area networks, possibly across network boundaries. To handle this challenge, we envision a test framework named Distributed Software Test Kit (DSTK). The DSTK leverages Combinatorial Test Design (CTD) to cover the functional requirements and then automatically covers the reliability requirements via under-the-hood closed loop between test execution feedback and AI based search algorithms. In each iteration of the closed loop, the search algorithms generate more reliability test scenarios to be executed next. Specifically, five kinds of reliability tests are envisioned: out-of-order execution of APIs, network delays and faults, API performance and throughput, changes in API call graph patterns, and changes in application topology.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
Machine Learning Model Drift Detection Via Weak Data Slices
Authors:
Samuel Ackerman,
Parijat Dube,
Eitan Farchi,
Orna Raz,
Marcel Zalmanovici
Abstract:
Detecting drift in performance of Machine Learning (ML) models is an acknowledged challenge. For ML models to become an integral part of business applications it is essential to detect when an ML model drifts away from acceptable operation. However, it is often the case that actual labels are difficult and expensive to get, for example, because they require expert judgment. Therefore, there is a n…
▽ More
Detecting drift in performance of Machine Learning (ML) models is an acknowledged challenge. For ML models to become an integral part of business applications it is essential to detect when an ML model drifts away from acceptable operation. However, it is often the case that actual labels are difficult and expensive to get, for example, because they require expert judgment. Therefore, there is a need for methods that detect likely degradation in ML operation without labels. We propose a method that utilizes feature space rules, called data slices, for drift detection. We provide experimental indications that our method is likely to identify that the ML model will likely change in performance, based on changes in the underlying data.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Broadly Applicable Targeted Data Sample Omission Attacks
Authors:
Guy Barash,
Eitan Farchi,
Sarit Kraus,
Onn Shehory
Abstract:
We introduce a novel clean-label targeted poisoning attack on learning mechanisms. While classical poisoning attacks typically corrupt data via addition, modification and omission, our attack focuses on data omission only. Our attack misclassifies a single, targeted test sample of choice, without manipulating that sample. We demonstrate the effectiveness of omission attacks against a large variety…
▽ More
We introduce a novel clean-label targeted poisoning attack on learning mechanisms. While classical poisoning attacks typically corrupt data via addition, modification and omission, our attack focuses on data omission only. Our attack misclassifies a single, targeted test sample of choice, without manipulating that sample. We demonstrate the effectiveness of omission attacks against a large variety of learners including deep neural networks, SVM and decision trees, using several datasets including MNIST, IMDB and CIFAR. The focus of our attack on data omission only is beneficial as well, as it is simpler to implement and analyze. We show that, with a low attack budget, our attack's success rate is above 80%, and in some cases 100%, for white-box learning. It is systematically above the reference benchmark for black-box learning. For both white-box and black-box cases, changes in model accuracy are negligible, regardless of the specific learner and dataset. We also prove theoretically in a simplified agnostic PAC learning framework that, subject to dataset size and distribution, our omission attack succeeds with high probability against any successful simplified agnostic PAC learner.
△ Less
Submitted 5 May, 2021; v1 submitted 4 May, 2021;
originally announced May 2021.
-
Detection of data drift and outliers affecting machine learning model performance over time
Authors:
Samuel Ackerman,
Eitan Farchi,
Orna Raz,
Marcel Zalmanovici,
Parijat Dube
Abstract:
A trained ML model is deployed on another `test' dataset where target feature values (labels) are unknown. Drift is distribution change between the training and deployment data, which is concerning if model performance changes. For a cat/dog image classifier, for instance, drift during deployment could be rabbit images (new class) or cat/dog images with changed characteristics (change in distribut…
▽ More
A trained ML model is deployed on another `test' dataset where target feature values (labels) are unknown. Drift is distribution change between the training and deployment data, which is concerning if model performance changes. For a cat/dog image classifier, for instance, drift during deployment could be rabbit images (new class) or cat/dog images with changed characteristics (change in distribution). We wish to detect these changes but can't measure accuracy without deployment data labels. We instead detect drift indirectly by nonparametrically testing the distribution of model prediction confidence for changes. This generalizes our method and sidesteps domain-specific feature representation.
We address important statistical issues, particularly Type-1 error control in sequential testing, using Change Point Models (CPMs; see Adams and Ross 2012). We also use nonparametric outlier methods to show the user suspicious observations for model diagnosis, since the before/after change confidence distributions overlap significantly. In experiments to demonstrate robustness, we train on a subset of MNIST digit classes, then insert drift (e.g., unseen digit class) in deployment data in various settings (gradual/sudden changes in the drift proportion). A novel loss function is introduced to compare the performance (detection delay, Type-1 and 2 errors) of a drift detector under different levels of drift class contamination.
△ Less
Submitted 6 September, 2022; v1 submitted 16 December, 2020;
originally announced December 2020.
-
A Game Theoretic Model for Strategic Coopetition in Business Networks
Authors:
Segev Wasserkrug,
Eitan Farchi
Abstract:
Private blockchain is driving the creation of business networks, resulting in the creation of new value or new business models to the enterprises participating in the network. Such business networks form when enterprises come together to derive value through a network which is greater than the value that can be derived solely by any single company. This results in a setting that combines both comp…
▽ More
Private blockchain is driving the creation of business networks, resulting in the creation of new value or new business models to the enterprises participating in the network. Such business networks form when enterprises come together to derive value through a network which is greater than the value that can be derived solely by any single company. This results in a setting that combines both competitive and cooperative behavior, and which we call strategic coopetition. Traditionally, cooperative and competitive behavior have been analyzed separately in game theory. In this article, we provide a formal model enabling to jointly analyze these different types of behaviors and the interdependencies between them. Using this model, we formally demonstrate and analyze the incentives for both cooperative and competitive behavior.
△ Less
Submitted 26 November, 2020;
originally announced November 2020.
-
Sequential Drift Detection in Deep Learning Classifiers
Authors:
Samuel Ackerman,
Parijat Dube,
Eitan Farchi
Abstract:
We utilize neural network embeddings to detect data drift by formulating the drift detection within an appropriate sequential decision framework. This enables control of the false alarm rate although the statistical tests are repeatedly applied. Since change detection algorithms naturally face a tradeoff between avoiding false alarms and quick correct detection, we introduce a loss function which…
▽ More
We utilize neural network embeddings to detect data drift by formulating the drift detection within an appropriate sequential decision framework. This enables control of the false alarm rate although the statistical tests are repeatedly applied. Since change detection algorithms naturally face a tradeoff between avoiding false alarms and quick correct detection, we introduce a loss function which evaluates an algorithm's ability to balance these two concerns, and we use it in a series of experiments.
△ Less
Submitted 31 July, 2020;
originally announced July 2020.
-
Engineering Reliable Deep Learning Systems
Authors:
P. Santhanam,
Eitan Farchi,
Victor Pankratius
Abstract:
Recent progress in artificial intelligence (AI) using deep learning techniques has triggered its wide-scale use across a broad range of applications. These systems can already perform tasks such as natural language processing of voice and text, visual recognition, question-answering, recommendations and decision support. However, at the current level of maturity, the use of an AI component in miss…
▽ More
Recent progress in artificial intelligence (AI) using deep learning techniques has triggered its wide-scale use across a broad range of applications. These systems can already perform tasks such as natural language processing of voice and text, visual recognition, question-answering, recommendations and decision support. However, at the current level of maturity, the use of an AI component in mission-critical or safety-critical applications can have unexpected consequences. Consequently, serious concerns about reliability, repeatability, trust, and maintainability of AI applications remain. As AI becomes pervasive despite its shortcomings, more systematic ways of approaching AI software development and certification are needed. These fundamental aspects establish the need for a discipline on "AI Engineering". This paper presents the current perspective of relevant AI engineering concepts and some key challenges that need to be overcome to make significant progress in this important area.
△ Less
Submitted 14 October, 2019;
originally announced October 2019.
-
Defending via strategic ML selection
Authors:
Eitan Farchi,
Onn Shehory,
Guy Barash
Abstract:
The results of a learning process depend on the input data. There are cases in which an adversary can strategically tamper with the input data to affect the outcome of the learning process. While some datasets are difficult to attack, many others are susceptible to manipulation. A resourceful attacker can tamper with large portions of the dataset and affect them. An attacker can additionally strat…
▽ More
The results of a learning process depend on the input data. There are cases in which an adversary can strategically tamper with the input data to affect the outcome of the learning process. While some datasets are difficult to attack, many others are susceptible to manipulation. A resourceful attacker can tamper with large portions of the dataset and affect them. An attacker can additionally strategically focus on a preferred subset of the attributes in the dataset to maximize the effectiveness of the attack and minimize the resources allocated to data manipulation. In light of this vulnerability, we introduce a solution according to which the defender implements an array of learners, and their activation is performed strategically. The defender computes the (game theoretic) strategy space and accordingly applies a dominant strategy where possible, and a Nash-stable strategy otherwise. In this paper we provide the details of this approach. We analyze Nash equilibrium in such a strategic learning environment, and demonstrate our solution by specific examples.
△ Less
Submitted 16 January, 2019;
originally announced April 2019.
-
Towards a Human-Centred Approach in Modelling and Testing of Cyber-Physical Systems
Authors:
Maria Spichkova,
Anna Zamansky,
Eitan Farchi
Abstract:
The ability to capture different levels of abstraction in a system model is especially important for remote integration, testing/verification, and manufacturing of cyber-physical systems (CPSs). However, the complexity of modelling and testing of CPSs makes these processes extremely prone to human error. In this paper we present our ongoing work on introducing human-centred considerations into mod…
▽ More
The ability to capture different levels of abstraction in a system model is especially important for remote integration, testing/verification, and manufacturing of cyber-physical systems (CPSs). However, the complexity of modelling and testing of CPSs makes these processes extremely prone to human error. In this paper we present our ongoing work on introducing human-centred considerations into modelling and testing of CPSs, which allow for agile iterative refinement processes of different levels of abstraction when errors are discovered or missing information is completed.
△ Less
Submitted 22 January, 2016;
originally announced January 2016.
-
Teaching Logic to Information Systems Students: Challenges and Opportunities
Authors:
Anna Zamansky,
Eitan Farchi
Abstract:
In contrast to Computer Science, where the fundamental role of Logic is widely recognized, it plays a practically non-existent role in Information Systems curricula. In this paper we argue that instead of Logic's exclusion from the IS curriculum, a significant adaptation of the contents, as well as teaching methodologies, is required for an alignment with the needs of IS practitioners. We present…
▽ More
In contrast to Computer Science, where the fundamental role of Logic is widely recognized, it plays a practically non-existent role in Information Systems curricula. In this paper we argue that instead of Logic's exclusion from the IS curriculum, a significant adaptation of the contents, as well as teaching methodologies, is required for an alignment with the needs of IS practitioners. We present our vision for such adaptation and report on concrete steps towards its implementation in the design and teaching of a course for graduate IS students at the University of Haifa. We discuss the course plan and present some data on the students' feedback on the course.
△ Less
Submitted 13 July, 2015;
originally announced July 2015.