-
Human-Centered Design Recommendations for LLM-as-a-Judge
Authors:
Qian Pan,
Zahra Ashktorab,
Michael Desmond,
Martin Santillan Cooper,
James Johnson,
Rahul Nair,
Elizabeth Daly,
Werner Geyer
Abstract:
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but…
▽ More
Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human's intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in develo** effective evaluation criteria aligning the LLM-as-a-judge with practitioners' preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
Authors:
Yufang Hou,
Alessandra Pascale,
Javier Carnerero-Cano,
Tigran Tchrakian,
Radu Marinescu,
Elizabeth Daly,
Inkit Padhi,
Prasanna Sattigeri
Abstract:
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In thi…
▽ More
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Authors:
Erik Miehling,
Manish Nagireddy,
Prasanna Sattigeri,
Elizabeth M. Daly,
David Piorkowski,
John T. Richards
Abstract:
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims -- quantity, quality, relevance, manner, benev…
▽ More
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive research from both the social science and AI communities, we propose a set of maxims -- quantity, quality, relevance, manner, benevolence, and transparency -- for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice) in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement with, harmful content) and transparency (concerning recognition of one's knowledge boundaries, operational constraints, and intents), are necessary for addressing behavior unique to modern human-AI interactions. We evaluate the degree to which various language models are able to understand these maxims and find that models possess an internal prioritization of principles that can significantly impact their ability to interpret the maxims accurately.
△ Less
Submitted 22 June, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations
Authors:
Swapnaja Achintalwar,
Adriana Alvarado Garcia,
Ateret Anaby-Tavor,
Ioana Baldini,
Sara E. Berger,
Bishwaranjan Bhattacharjee,
Djallel Bouneffouf,
Subhajit Chaudhury,
Pin-Yu Chen,
Lamogha Chiazor,
Elizabeth M. Daly,
Kirushikesh DB,
Rogério Abreu de Paula,
Pierre Dognin,
Eitan Farchi,
Soumya Ghosh,
Michael Hind,
Raya Horesh,
George Kour,
Ja Young Lee,
Nishtha Madaan,
Sameep Mehta,
Erik Miehling,
Keerthiram Murugesan,
Manish Nagireddy
, et al. (13 additional authors not shown)
Abstract:
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we presen…
▽ More
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.
△ Less
Submitted 13 June, 2024; v1 submitted 9 March, 2024;
originally announced March 2024.
-
Ranking Large Language Models without Ground Truth
Authors:
Amit Dhurandhar,
Rahul Nair,
Moninder Singh,
Elizabeth Daly,
Karthikeyan Natesan Ramamurthy
Abstract:
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructi…
▽ More
Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.
△ Less
Submitted 10 June, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Explaining Knock-on Effects of Bias Mitigation
Authors:
Svetoslav Nizhnichenkov,
Rahul Nair,
Elizabeth Daly,
Brian Mac Namee
Abstract:
In machine learning systems, bias mitigation approaches aim to make outcomes fairer across privileged and unprivileged groups. Bias mitigation methods work in different ways and have known "waterfall" effects, e.g., mitigating bias at one place may manifest bias elsewhere. In this paper, we aim to characterise impacted cohorts when mitigation interventions are applied. To do so, we treat intervent…
▽ More
In machine learning systems, bias mitigation approaches aim to make outcomes fairer across privileged and unprivileged groups. Bias mitigation methods work in different ways and have known "waterfall" effects, e.g., mitigating bias at one place may manifest bias elsewhere. In this paper, we aim to characterise impacted cohorts when mitigation interventions are applied. To do so, we treat intervention effects as a classification task and learn an explainable meta-classifier to identify cohorts that have altered outcomes. We examine a range of bias mitigation strategies that work at various stages of the model life cycle. We empirically demonstrate that our meta-classifier is able to uncover impacted cohorts. Further, we show that all tested mitigation strategies negatively impact a non-trivial fraction of cases, i.e., people who receive unfavourable outcomes solely on account of mitigation efforts. This is despite improvement in fairness metrics. We use these results as a basis to argue for more careful audits of static mitigation interventions that go beyond aggregate metrics.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Iterative Reward Sha** using Human Feedback for Correcting Reward Misspecification
Authors:
Jasmina Gajcin,
James McCarthy,
Rahul Nair,
Radu Marinescu,
Elizabeth Daly,
Ivana Dusparic
Abstract:
A well-defined reward function is crucial for successful training of an reinforcement learning (RL) agent. However, defining a suitable reward function is a notoriously challenging task, especially in complex, multi-objective environments. Developers often have to resort to starting with an initial, potentially misspecified reward function, and iteratively adjusting its parameters, based on observ…
▽ More
A well-defined reward function is crucial for successful training of an reinforcement learning (RL) agent. However, defining a suitable reward function is a notoriously challenging task, especially in complex, multi-objective environments. Developers often have to resort to starting with an initial, potentially misspecified reward function, and iteratively adjusting its parameters, based on observed learned behavior. In this work, we aim to automate this process by proposing ITERS, an iterative reward sha** approach using human feedback for mitigating the effects of a misspecified reward function. Our approach allows the user to provide trajectory-level feedback on agent's behavior during training, which can be integrated as a reward sha** signal in the following training iteration. We also allow the user to provide explanations of their feedback, which are used to augment the feedback and reduce user effort and feedback frequency. We evaluate ITERS in three environments and show that it can successfully correct misspecified reward functions.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Interpretable Differencing of Machine Learning Models
Authors:
Swagatam Haldar,
Diptikalyan Saha,
Dennis Wei,
Rahul Nair,
Elizabeth M. Daly
Abstract:
Understanding the differences between machine learning (ML) models is of interest in scenarios ranging from choosing amongst a set of competing models, to updating a deployed model with new training data. In these cases, we wish to go beyond differences in overall metrics such as accuracy to identify where in the feature space do the differences occur. We formalize this problem of model differenci…
▽ More
Understanding the differences between machine learning (ML) models is of interest in scenarios ranging from choosing amongst a set of competing models, to updating a deployed model with new training data. In these cases, we wish to go beyond differences in overall metrics such as accuracy to identify where in the feature space do the differences occur. We formalize this problem of model differencing as one of predicting a dissimilarity function of two ML models' outputs, subject to the representation of the differences being human-interpretable. Our solution is to learn a Joint Surrogate Tree (JST), which is composed of two conjoined decision tree surrogates for the two models. A JST provides an intuitive representation of differences and places the changes in the context of the models' decision logic. Context is important as it helps users to map differences to an underlying mental model of an AI system. We also propose a refinement procedure to increase the precision of a JST. We demonstrate, through an empirical evaluation, that such contextual differencing is concise and can be achieved with no loss in fidelity over naive approaches.
△ Less
Submitted 13 June, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
AutoDOViz: Human-Centered Automation for Decision Optimization
Authors:
Daniel Karl I. Weidele,
Shazia Afzal,
Abel N. Valente,
Cole Makuch,
Owen Cornec,
Long Vu,
Dharmashankar Subramanian,
Werner Geyer,
Rahul Nair,
Inge Vejsbjerg,
Radu Marinescu,
Paulito Palmes,
Elizabeth M. Daly,
Loraine Franke,
Daniel Haehn
Abstract:
We present AutoDOViz, an interactive user interface for automated decision optimization (AutoDO) using reinforcement learning (RL). Decision optimization (DO) has classically being practiced by dedicated DO researchers where experts need to spend long periods of time fine tuning a solution through trial-and-error. AutoML pipeline search has sought to make it easier for a data scientist to find the…
▽ More
We present AutoDOViz, an interactive user interface for automated decision optimization (AutoDO) using reinforcement learning (RL). Decision optimization (DO) has classically being practiced by dedicated DO researchers where experts need to spend long periods of time fine tuning a solution through trial-and-error. AutoML pipeline search has sought to make it easier for a data scientist to find the best machine learning pipeline by leveraging automation to search and tune the solution. More recently, these advances have been applied to the domain of AutoDO, with a similar goal to find the best reinforcement learning pipeline through algorithm selection and parameter tuning. However, Decision Optimization requires significantly more complex problem specification when compared to an ML problem. AutoDOViz seeks to lower the barrier of entry for data scientists in problem specification for reinforcement learning problems, leverage the benefits of AutoDO algorithms for RL pipeline search and finally, create visualizations and policy insights in order to facilitate the typical interactive nature when communicating problem formulation and solution proposals between DO experts and domain experts. In this paper, we report our findings from semi-structured expert interviews with DO practitioners as well as business consultants, leading to design requirements for human-centered automation for DO with RL. We evaluate a system implementation with data scientists and find that they are significantly more open to engage in DO after using our proposed solution. AutoDOViz further increases trust in RL agent models and makes the automated training and evaluation process more comprehensible. As shown for other automation in ML tasks, we also conclude automation of RL for DO can benefit from user and vice-versa when the interface promotes human-in-the-loop.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Development of a prototype superconducting radio-frequency cavity for conduction-cooled accelerators
Authors:
G. Ciovati,
J. Anderson,
S. Balachandran,
G. Cheng,
B. Coriton,
E. Daly,
P. Dhakal,
A. Gurevich,
F. Hannon,
K. Harding,
L. Holland,
F. Marhauser,
K. McLaughlin,
D. Packard,
T. Powers,
U. Pudasaini,
J. Rathke,
R. Rimmer,
T. Schultheiss,
H. Vennekate,
D. Vollmer
Abstract:
The higher efficiency of superconducting radio-frequency (SRF) cavities compared to normal-conducting ones enables the development of high-energy continuous-wave linear accelerators (linacs). Recent progress in the development of high-quality Nb$_3$Sn film coatings along with the availability of cryocoolers with high cooling capacity at 4 K makes it feasible to operate SRF cavities cooled by therm…
▽ More
The higher efficiency of superconducting radio-frequency (SRF) cavities compared to normal-conducting ones enables the development of high-energy continuous-wave linear accelerators (linacs). Recent progress in the development of high-quality Nb$_3$Sn film coatings along with the availability of cryocoolers with high cooling capacity at 4 K makes it feasible to operate SRF cavities cooled by thermal conduction at relevant accelerating gradients for use in accelerators. A possible use of conduction-cooled SRF linacs is for environmental applications, requiring electron beams with energy of $1 - 10$ MeV and 1 MW of power. We have designed a 915 MHz SRF linac for such an application and developed a prototype single-cell cavity to prove the proposed design by operating it with cryocoolers at the accelerating gradient required for 1 MeV energy gain. The cavity has a $\sim 3$ $μ$m thick Nb$_3$Sn film on the inner surface, deposited on a $\sim4$ mm thick bulk Nb substrate and a bulk $\sim7$ mm thick Cu outer shell with three Cu attachment tabs. The cavity was tested up to a peak surface magnetic field of 53 mT in liquid He at 4.3 K. A horizontal test cryostat was designed and built to test the cavity cooled with three Gifford-McMahon cryocoolers. The rf tests of the conduction-cooled cavity, performed at General Atomics, achieved a peak surface magnetic field of 50 mT and stable operation was possible with up to 18.5 W of rf heat load. The peak frequency shift due to microphonics was 23 Hz. These results represent the highest peak surface magnetic field achieved in a conduction-cooled SRF cavity to date and meet the requirements for a 1 MeV energy gain.
△ Less
Submitted 22 March, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach
Authors:
Dennis Wei,
Rahul Nair,
Amit Dhurandhar,
Kush R. Varshney,
Elizabeth M. Daly,
Moninder Singh
Abstract:
Interpretable and explainable machine learning has seen a recent surge of interest. We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative. Toward assessing safety, we introduce the concept of maximum deviation via an optimization problem to find the largest deviation of a supervised learning model from a reference mo…
▽ More
Interpretable and explainable machine learning has seen a recent surge of interest. We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative. Toward assessing safety, we introduce the concept of maximum deviation via an optimization problem to find the largest deviation of a supervised learning model from a reference model regarded as safe. We then show how interpretability facilitates this safety assessment. For models including decision trees, generalized linear and additive models, the maximum deviation can be computed exactly and efficiently. For tree ensembles, which are not regarded as interpretable, discrete optimization techniques can still provide informative bounds. For a broader class of piecewise Lipschitz functions, we leverage the multi-armed bandit literature to show that interpretability produces tighter (regret) bounds on the maximum deviation. We present case studies, including one on mortgage approval, to illustrate our methods and the insights about models that may be obtained from deviation maximization.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Leveraging Explanations in Interactive Machine Learning: An Overview
Authors:
Stefano Teso,
Öznur Alkan,
Wolfang Stammer,
Elizabeth Daly
Abstract:
Explanations have gained an increasing level of interest in the AI and Machine Learning (ML) communities in order to improve model transparency and allow users to form a mental model of a trained ML model. However, explanations can go beyond this one way communication as a mechanism to elicit user control, because once users understand, they can then provide feedback. The goal of this paper is to…
▽ More
Explanations have gained an increasing level of interest in the AI and Machine Learning (ML) communities in order to improve model transparency and allow users to form a mental model of a trained ML model. However, explanations can go beyond this one way communication as a mechanism to elicit user control, because once users understand, they can then provide feedback. The goal of this paper is to present an overview of research where explanations are combined with interactive capabilities as a mean to learn new models from scratch and to edit and debug existing ones. To this end, we draw a conceptual map of the state-of-the-art, grou** relevant approaches based on their intended purpose and on how they structure the interaction, highlighting similarities and differences between them. We also discuss open research issues and outline possible directions forward, with the hope of spurring further research on this blooming research topic.
△ Less
Submitted 9 October, 2022; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Boolean Decision Rules for Reinforcement Learning Policy Summarisation
Authors:
James McCarthy,
Rahul Nair,
Elizabeth Daly,
Radu Marinescu,
Ivana Dusparic
Abstract:
Explainability of Reinforcement Learning (RL) policies remains a challenging research problem, particularly when considering RL in a safety context. Understanding the decisions and intentions of an RL policy offer avenues to incorporate safety into the policy by limiting undesirable actions. We propose the use of a Boolean Decision Rules model to create a post-hoc rule-based summary of an agent's…
▽ More
Explainability of Reinforcement Learning (RL) policies remains a challenging research problem, particularly when considering RL in a safety context. Understanding the decisions and intentions of an RL policy offer avenues to incorporate safety into the policy by limiting undesirable actions. We propose the use of a Boolean Decision Rules model to create a post-hoc rule-based summary of an agent's policy. We evaluate our proposed approach using a DQN agent trained on an implementation of a lava gridworld and show that, given a hand-crafted feature representation of this gridworld, simple generalised rules can be created, giving a post-hoc explainable summary of the agent's policy. We discuss possible avenues to introduce safety into a RL agent's policy by using rules generated by this rule-based model as constraints imposed on the agent's policy, as well as discuss how creating simple rule summaries of an agent's policy may help in the debugging process of RL agents.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
User Driven Model Adjustment via Boolean Rule Explanations
Authors:
Elizabeth M. Daly,
Massimiliano Mattetti,
Öznur Alkan,
Rahul Nair
Abstract:
AI solutions are heavily dependant on the quality and accuracy of the input training data, however the training data may not always fully reflect the most up-to-date policy landscape or may be missing business logic. The advances in explainability have opened the possibility of allowing users to interact with interpretable explanations of ML predictions in order to inject modifications or constrai…
▽ More
AI solutions are heavily dependant on the quality and accuracy of the input training data, however the training data may not always fully reflect the most up-to-date policy landscape or may be missing business logic. The advances in explainability have opened the possibility of allowing users to interact with interpretable explanations of ML predictions in order to inject modifications or constraints that more accurately reflect current realities of the system. In this paper, we present a solution which leverages the predictive power of ML models while allowing the user to specify modifications to decision boundaries. Our interactive overlay approach achieves this goal without requiring model retraining, making it appropriate for systems that need to apply instant changes to their decision making. We demonstrate that user feedback rules can be layered with the ML predictions to provide immediate changes which in turn supports learning with less data.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
FROTE: Feedback Rule-Driven Oversampling for Editing Models
Authors:
Öznur Alkan,
Dennis Wei,
Massimiliano Mattetti,
Rahul Nair,
Elizabeth M. Daly,
Diptikalyan Saha
Abstract:
Machine learning models may involve decision boundaries that change over time due to updates to rules and regulations, such as in loan approvals or claims management. However, in such scenarios, it may take time for sufficient training data to accumulate in order to retrain the model to reflect the new decision boundaries. While work has been done to reinforce existing decision boundaries, very li…
▽ More
Machine learning models may involve decision boundaries that change over time due to updates to rules and regulations, such as in loan approvals or claims management. However, in such scenarios, it may take time for sufficient training data to accumulate in order to retrain the model to reflect the new decision boundaries. While work has been done to reinforce existing decision boundaries, very little has been done to cover these scenarios where decision boundaries of the ML models should change in order to reflect new rules. In this paper, we focus on user-provided feedback rules as a way to expedite the ML models update process, and we formally introduce the problem of pre-processing training data to edit an ML model in response to feedback rules such that once the model is retrained on the pre-processed data, its decision boundaries align more closely with the rules. To solve this problem, we propose a novel data augmentation method, the Feedback Rule-Based Oversampling Technique. Extensive experiments using different ML models and real world datasets demonstrate the effectiveness of the method, in particular the benefit of augmentation and the ability to handle many feedback rules.
△ Less
Submitted 6 January, 2022; v1 submitted 4 January, 2022;
originally announced January 2022.
-
Contrastive Explanations for Comparing Preferences of Reinforcement Learning Agents
Authors:
Jasmina Gajcin,
Rahul Nair,
Tejaswini Pedapati,
Radu Marinescu,
Elizabeth Daly,
Ivana Dusparic
Abstract:
In complex tasks where the reward function is not straightforward and consists of a set of objectives, multiple reinforcement learning (RL) policies that perform task adequately, but employ different strategies can be trained by adjusting the impact of individual objectives on reward function. Understanding the differences in strategies between policies is necessary to enable users to choose betwe…
▽ More
In complex tasks where the reward function is not straightforward and consists of a set of objectives, multiple reinforcement learning (RL) policies that perform task adequately, but employ different strategies can be trained by adjusting the impact of individual objectives on reward function. Understanding the differences in strategies between policies is necessary to enable users to choose between offered policies, and can help developers understand different behaviors that emerge from various reward functions and training hyperparameters in RL systems. In this work we compare behavior of two policies trained on the same task, but with different preferences in objectives. We propose a method for distinguishing between differences in behavior that stem from different abilities from those that are a consequence of opposing preferences of two RL agents. Furthermore, we use only data on preference-based differences in order to generate contrasting explanations about agents' preferences. Finally, we test and evaluate our approach on an autonomous driving task and compare the behavior of a safety-oriented policy and one that prefers speed.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification
Authors:
Simon Graham,
Mostafa Jahanifar,
Ayesha Azam,
Mohammed Nimir,
Yee-Wah Tsang,
Katherine Dodd,
Emily Hero,
Harvir Sahota,
Atisha Tank,
Ksenija Benes,
Noorul Wahab,
Fayyaz Minhas,
Shan E Ahmed Raza,
Hesham El Daly,
Kishore Gopalakrishnan,
David Snead,
Nasir Rajpoot
Abstract:
The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed ann…
▽ More
The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. However, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We have released the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.
△ Less
Submitted 29 November, 2021; v1 submitted 25 August, 2021;
originally announced August 2021.
-
Designing Machine Learning Pipeline Toolkit for AutoML Surrogate Modeling Optimization
Authors:
Paulito P. Palmes,
Akihiro Kishimoto,
Radu Marinescu,
Parikshit Ram,
Elizabeth Daly
Abstract:
The pipeline optimization problem in machine learning requires simultaneous optimization of pipeline structures and parameter adaptation of their elements. Having an elegant way to express these structures can help lessen the complexity in the management and analysis of their performances together with the different choices of optimization strategies. With these issues in mind, we created the Auto…
▽ More
The pipeline optimization problem in machine learning requires simultaneous optimization of pipeline structures and parameter adaptation of their elements. Having an elegant way to express these structures can help lessen the complexity in the management and analysis of their performances together with the different choices of optimization strategies. With these issues in mind, we created the AutoMLPipeline (AMLP) toolkit which facilitates the creation and evaluation of complex machine learning pipeline structures using simple expressions. We use AMLP to find optimal pipeline signatures, datamine them, and use these datamined features to speed-up learning and prediction. We formulated a two-stage pipeline optimization with surrogate modeling in AMLP which outperforms other AutoML approaches with a 4-hour time budget in less than 5 minutes of AMLP computation time.
△ Less
Submitted 13 July, 2021; v1 submitted 2 July, 2021;
originally announced July 2021.
-
The Large Hadron-Electron Collider at the HL-LHC
Authors:
P. Agostini,
H. Aksakal,
S. Alekhin,
P. P. Allport,
N. Andari,
K. D. J. Andre,
D. Angal-Kalinin,
S. Antusch,
L. Aperio Bella,
L. Apolinario,
R. Apsimon,
A. Apyan,
G. Arduini,
V. Ari,
A. Armbruster,
N. Armesto,
B. Auchmann,
K. Aulenbacher,
G. Azuelos,
S. Backovic,
I. Bailey,
S. Bailey,
F. Balli,
S. Behera,
O. Behnke
, et al. (312 additional authors not shown)
Abstract:
The Large Hadron electron Collider (LHeC) is designed to move the field of deep inelastic scattering (DIS) to the energy and intensity frontier of particle physics. Exploiting energy recovery technology, it collides a novel, intense electron beam with a proton or ion beam from the High Luminosity--Large Hadron Collider (HL-LHC). The accelerator and interaction region are designed for concurrent el…
▽ More
The Large Hadron electron Collider (LHeC) is designed to move the field of deep inelastic scattering (DIS) to the energy and intensity frontier of particle physics. Exploiting energy recovery technology, it collides a novel, intense electron beam with a proton or ion beam from the High Luminosity--Large Hadron Collider (HL-LHC). The accelerator and interaction region are designed for concurrent electron-proton and proton-proton operation. This report represents an update of the Conceptual Design Report (CDR) of the LHeC, published in 2012. It comprises new results on parton structure of the proton and heavier nuclei, QCD dynamics, electroweak and top-quark physics. It is shown how the LHeC will open a new chapter of nuclear particle physics in extending the accessible kinematic range in lepton-nucleus scattering by several orders of magnitude. Due to enhanced luminosity, large energy and the cleanliness of the hadronic final states, the LHeC has a strong Higgs physics programme and its own discovery potential for new physics. Building on the 2012 CDR, the report represents a detailed updated design of the energy recovery electron linac (ERL) including new lattice, magnet, superconducting radio frequency technology and further components. Challenges of energy recovery are described and the lower energy, high current, 3-turn ERL facility, PERLE at Orsay, is presented which uses the LHeC characteristics serving as a development facility for the design and operation of the LHeC. An updated detector design is presented corresponding to the acceptance, resolution and calibration goals which arise from the Higgs and parton density function physics programmes. The paper also presents novel results on the Future Circular Collider in electron-hadron mode, FCC-eh, which utilises the same ERL technology to further extend the reach of DIS to even higher centre-of-mass energies.
△ Less
Submitted 12 April, 2021; v1 submitted 28 July, 2020;
originally announced July 2020.
-
Client Network: An Interactive Model for Predicting New Clients
Authors:
Massimiliano Mattetti,
Akihiro Kishimoto,
Adi Botea,
Elizabeth Daly,
Inge Vejsbjerg,
Bei Chen,
Öznur Alkan
Abstract:
Understanding prospective clients becomes increasingly important as companies aim to enlarge their market bases. Traditional approaches typically treat each client in isolation, either studying its interactions or similarities with existing clients. We propose the Client Network, which considers the entire client ecosystem to predict the success of sale pitches for targeted clients by complex netw…
▽ More
Understanding prospective clients becomes increasingly important as companies aim to enlarge their market bases. Traditional approaches typically treat each client in isolation, either studying its interactions or similarities with existing clients. We propose the Client Network, which considers the entire client ecosystem to predict the success of sale pitches for targeted clients by complex network analysis. It combines a novel ranking algorithm with data visualization and navigation. Based on historical interaction data between companies and clients, the Client Network leverages organizational connectivity to locate the optimal paths to prospective clients. The user interface supports exploring the client ecosystem and performing sales-essential tasks. Our experiments and user interviews demonstrate the effectiveness of the Client Network and its success in supporting sellers' day-to-day tasks.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
User Profiling from Reviews for Accurate Time-Based Recommendations
Authors:
Oznur Alkan,
Elizabeth Daly
Abstract:
Recommender systems are a valuable way to engage users in a system, increase participation and show them resources they may not have found otherwise. One significant challenge is that user interests may change over time and certain items have an inherently temporal aspect. As a result, a recommender system should try and take into account the time-dependant user-item relationships. However, tempor…
▽ More
Recommender systems are a valuable way to engage users in a system, increase participation and show them resources they may not have found otherwise. One significant challenge is that user interests may change over time and certain items have an inherently temporal aspect. As a result, a recommender system should try and take into account the time-dependant user-item relationships. However, temporal aspects of a user profile may not always be explicitly available and so we may need to infer this information from available resources. Product reviews on sites, such as Amazon, represent a valuable data source to understand why someone bought an item and potentially who the item is for. This information can then be used to construct a dynamic user profile. In this paper, we demonstrate utilising reviews to extract temporal information to infer the \textit{age category preference} of users, and leverage this feature to generate time-dependent recommendations. Given the predictable and yet shifting nature of age and time, we show that, recommendations generated using this dynamic aspect lead to higher accuracy compared with techniques from state of art. Mining temporally related content in reviews can enable the recommender to go beyond finding similar items or users to potentially predict a future need of a user.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
IRF: Interactive Recommendation through Dialogue
Authors:
Oznur Alkan,
Massimiliano Mattetti,
Elizabeth M. Daly,
Adi Botea,
Inge Vejsbjerg
Abstract:
Recent research focuses beyond recommendation accuracy, towards human factors that influence the acceptance of recommendations, such as user satisfaction, trust, transparency and sense of control.We present a generic interactive recommender framework that can add interaction functionalities to non-interactive recommender systems.We take advantage of dialogue systems to interact with the user and w…
▽ More
Recent research focuses beyond recommendation accuracy, towards human factors that influence the acceptance of recommendations, such as user satisfaction, trust, transparency and sense of control.We present a generic interactive recommender framework that can add interaction functionalities to non-interactive recommender systems.We take advantage of dialogue systems to interact with the user and we design a middleware layer to provide the interaction functions, such as providing explanations for the recommendations, managing users preferences learnt from dialogue, preference elicitation and refining recommendations based on learnt preferences.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
An Evaluation Framework for Interactive Recommender System
Authors:
Oznur Alkan,
Elizabeth M. Daly,
Adi Botea
Abstract:
Traditional recommender systems present a relatively static list of recommendations to a user where the feedback is typically limited to an accept/reject or a rating model. However, these simple modes of feedback may only provide limited insights as to why a user likes or dislikes an item and what aspects of the item the user has considered. Interactive recommender systems present an opportunity t…
▽ More
Traditional recommender systems present a relatively static list of recommendations to a user where the feedback is typically limited to an accept/reject or a rating model. However, these simple modes of feedback may only provide limited insights as to why a user likes or dislikes an item and what aspects of the item the user has considered. Interactive recommender systems present an opportunity to engage the user in the process by allowing them to interact with the recommendations, provide feedback and impact the results in real-time. Evaluation of the impact of the user interaction typically requires an extensive user study which is time consuming and gives researchers limited opportunities to tune their solutions without having to conduct multiple rounds of user feedback. Additionally, user experience and design aspects can have a significant impact on the user feedback which may result in not necessarily assessing the quality of some of the underlying algorithmic decisions in the overall solution. As a result, we present an evaluation framework which aims to simulate the users interacting with the recommender. We formulate metrics to evaluate the quality of the interactive recommenders which are outputted by the framework once simulation is completed. While simulation along is not sufficient to evaluate a complete solution, the results can be useful to help researchers tune their solution before moving to the user study stage.
△ Less
Submitted 16 April, 2019;
originally announced April 2019.
-
Energy Constraints for Climate Models from Hydrologic Partitioning
Authors:
Jun Yin,
Salvatore Calabrese,
Edoardo Daly,
Amilcare Porporato
Abstract:
It is well known that evaporative cooling of Earth's surface water reduces the amount of radiation that goes into sensible heat, namely the portion of radiation that produces higher temperatures. However, a rigorous use of long-term hydrologic measurements and the related theories of hydrologic partitioning have not yet been fully exploited to quantify these effects on climate. Here, we show that…
▽ More
It is well known that evaporative cooling of Earth's surface water reduces the amount of radiation that goes into sensible heat, namely the portion of radiation that produces higher temperatures. However, a rigorous use of long-term hydrologic measurements and the related theories of hydrologic partitioning have not yet been fully exploited to quantify these effects on climate. Here, we show that the Budyko's curve, a well-known and efficient framework for water balance estimation, can be effectively utilized to partition the surface energy fluxes by expressing the long-term evaporative fraction as a function of the dryness index. The combination of this energy partitioning method with hydrological observations allows us to estimate the surface energy components at watershed and continental scales. Analyzing climate model outputs through this new lens reveals energy biases due to inaccurate parameterization of hydrological and atmospheric processes, offering insight into model parameterization and providing useful information for improving climate projections.
△ Less
Submitted 27 February, 2019;
originally announced February 2019.
-
Generating Dialogue Agents via Automated Planning
Authors:
Adi Botea,
Christian Muise,
Shubham Agarwal,
Oznur Alkan,
Ondrej Bajgar,
Elizabeth Daly,
Akihiro Kishimoto,
Luis Lastras,
Radu Marinescu,
Josef Ondrej,
Pablo Pedemonte,
Miroslav Vodolan
Abstract:
Dialogue systems have many applications such as customer support or question answering. Typically they have been limited to shallow single turn interactions. However more advanced applications such as career coaching or planning a trip require a much more complex multi-turn dialogue. Current limitations of conversational systems have made it difficult to support applications that require personali…
▽ More
Dialogue systems have many applications such as customer support or question answering. Typically they have been limited to shallow single turn interactions. However more advanced applications such as career coaching or planning a trip require a much more complex multi-turn dialogue. Current limitations of conversational systems have made it difficult to support applications that require personalization, customization and context dependent interactions. We tackle this challenging problem by using domain-independent AI planning to automatically create dialogue plans, customized to guide a dialogue towards achieving a given goal. The input includes a library of atomic dialogue actions, an initial state of the dialogue, and a goal. Dialogue plans are plugged into a dialogue system capable to orchestrate their execution. Use cases demonstrate the viability of the approach. Our work on dialogue planning has been integrated into a product, and it is in the process of being deployed into another.
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
PERLE: Powerful Energy Recovery Linac for Experiments - Conceptual Design Report
Authors:
D. Angal-Kalinin,
G. Arduini,
B. Auchmann,
J. Bernauer,
A. Bogacz,
F. Bordry,
S. Bousson,
C. Bracco,
O. Brüning,
R. Calaga,
K. Cassou,
V. Chetvertkova,
E. Cormier,
E. Daly,
D. Douglas,
K. Dupraz,
B. Goddard,
J. Henry,
A. Hutton,
E. Jensen,
W. Kaabi,
M. Klein,
P. Kostka,
F. Marhauser,
A. Martens
, et al. (17 additional authors not shown)
Abstract:
A conceptual design is presented of a novel ERL facility for the development and application of the energy recovery technique to linear electron accelerators in the multi-turn, large current and large energy regime. The main characteristics of the powerful energy recovery linac experiment facility (PERLE) are derived from the design of the Large Hadron electron Collider, an electron beam upgrade u…
▽ More
A conceptual design is presented of a novel ERL facility for the development and application of the energy recovery technique to linear electron accelerators in the multi-turn, large current and large energy regime. The main characteristics of the powerful energy recovery linac experiment facility (PERLE) are derived from the design of the Large Hadron electron Collider, an electron beam upgrade under study for the LHC, for which it would be the key demonstrator. PERLE is thus projected as a facility to investigate efficient, high current (> 10 mA) ERL operation with three re-circulation passages through newly designed SCRF cavities, at 801.58 MHz frequency, and following deceleration over another three re-circulations. In its fully equipped configuration, PERLE provides an electron beam of approximately 1 GeV energy. A physics programme possibly associated with PERLE is sketched, consisting of high precision elastic electron-proton scattering experiments, as well as photo-nuclear reactions of unprecedented intensities with up to 30 MeV photon beam energy as may be obtained using Fabry-Perot cavities. The facility has further applications as a general technology test bed that can investigate and validate novel superconducting magnets (beam induced quench tests) and superconducting RF structures (structure tests with high current beams, beam loading and transients). Besides a chapter on operation aspects, the report contains detailed considerations on the choices for the SCRF structure, optics and lattice design, solutions for arc magnets, source and injector and on further essential components. A suitable configuration derived from the here presented design concept may next be moved forward to a technical design and possibly be built by an international collaboration which is being established.
△ Less
Submitted 24 May, 2017;
originally announced May 2017.
-
Dynamic Probabilistic Network Based Human Action Recognition
Authors:
Anne Veenendaal,
Eddie Jones,
Zhao Gang,
Elliot Daly,
Sumalini Vartak,
Rahul Patwardhan
Abstract:
This paper examines use of dynamic probabilistic networks (DPN) for human action recognition. The actions of lifting objects and walking in the room, sitting in the room and neutral standing pose were used for testing the classification. The research used the dynamic interrelation between various different regions of interest (ROI) on the human body (face, body, arms, legs) and the time series bas…
▽ More
This paper examines use of dynamic probabilistic networks (DPN) for human action recognition. The actions of lifting objects and walking in the room, sitting in the room and neutral standing pose were used for testing the classification. The research used the dynamic interrelation between various different regions of interest (ROI) on the human body (face, body, arms, legs) and the time series based events related to the these ROIs. This dynamic links are then used to recognize the human behavioral aspects in the scene. First a model is developed to identify the human activities in an indoor scene and this model is dependent on the key features and interlinks between the various dynamic events using DPNs. The sub ROI are classified with DPN to associate the combined interlink with a specific human activity. The recognition accuracy performance between indoor (controlled lighting conditions) is compared with the outdoor lighting conditions. The accuracy in outdoor scenes was lower than the controlled environment.
△ Less
Submitted 25 July, 2016;
originally announced October 2016.
-
Code Definition Analysis for Call Graph Generation
Authors:
Anne Veenendaal,
Elliot Daly,
Eddie Jones,
Zhao Gang,
Sumalini Vartak,
Rahul S Patwardhan
Abstract:
Enterprise level software is implemented using multi-layer architecture. These layers are often implemented using de-coupled solutions with millions of lines of code. Programmers often have to track and debug a function call from user interface layer to the data access layer while troubleshooting an issue. They have to inspect the code based on search results or use design documents to construct t…
▽ More
Enterprise level software is implemented using multi-layer architecture. These layers are often implemented using de-coupled solutions with millions of lines of code. Programmers often have to track and debug a function call from user interface layer to the data access layer while troubleshooting an issue. They have to inspect the code based on search results or use design documents to construct the call graph. This process is time consuming and laborious. The development environment tools are insufficient or confined to analyzing only the code in the loaded solution. This paper proposes a method to construct a call graph of the call across several layers of the code residing in different code bases to help programmers better understand the design and architecture of the software. The signatures of class, methods, and properties were evaluated and then matched against the code files. A graph of matching functions was created. The recursive search stopped when there were no matches or the data layer code was detected. The method resulted in 78.26% accuracy when compared with manual search.
△ Less
Submitted 26 July, 2016;
originally announced October 2016.
-
Transport of LCLS-II 1.3 Ghz cryomodule to SLAC
Authors:
M. W. McGee,
T. Arkan,
T. Peterson,
Z. Tang,
S. Boo,
M. Carrasco,
E. Daly,
N. Huque
Abstract:
In a partnership with SLAC National Accelerator Laboratory (SLAC) and Jefferson Lab, Fermilab will assemble and test 17 of the 35 total 1.3 GHz cryomodules for the Linac Coherent Light Source II (LCLS-II) Project. These include a prototype built and delivered by each Lab. Another two 3.9 GHz cryomodules will be built, tested and transported by Fermilab to SLAC. Each assembly will be transported ov…
▽ More
In a partnership with SLAC National Accelerator Laboratory (SLAC) and Jefferson Lab, Fermilab will assemble and test 17 of the 35 total 1.3 GHz cryomodules for the Linac Coherent Light Source II (LCLS-II) Project. These include a prototype built and delivered by each Lab. Another two 3.9 GHz cryomodules will be built, tested and transported by Fermilab to SLAC. Each assembly will be transported over-the-road from Fermilab or Jefferson Lab using specific routes to SLAC. The transport system consists of a base frame, isolation fixture and upper protective truss. The strongback cryomodule lifting fixture is described along with other supporting equipment used for both over-the-road transport and local (on-site) transport at Fermilab. Initially, analysis of fragile components and stability studies will be performed in order to assess the risk associated with over-the-road transport of a fully assembled cryomodule.
△ Less
Submitted 30 June, 2016;
originally announced July 2016.
-
MEIC Design Summary
Authors:
S. Abeyratne,
D. Barber,
A. Bogacz,
P. Brindza,
Y. Cai,
A. Camsonne,
A. Castilla,
P. Chevtsov,
E. Daly,
Y. S. Derbenev,
D. Douglas,
V. Dudnikov,
R. Ent,
B. Erdelyi,
Y. Filatov,
D. Gaskell,
J. Grames,
J. Guo,
L. Harwood,
A. Hutton,
C. Hyde,
K. Jordan,
A. Kimber,
G. A. Krafft,
A. Kondratenko
, et al. (30 additional authors not shown)
Abstract:
This document summarizes the design of Jefferson Lab's electron-ion collider, MEIC, as of January 20, 2015, and describes the facility whose cost was estimated for the United States Department of Energy Nuclear Sciences Advisory Committee EIC cost review of January 26-28, 2015. In particular, each of the main technical systems within the collider is presented to the level of the best current infor…
▽ More
This document summarizes the design of Jefferson Lab's electron-ion collider, MEIC, as of January 20, 2015, and describes the facility whose cost was estimated for the United States Department of Energy Nuclear Sciences Advisory Committee EIC cost review of January 26-28, 2015. In particular, each of the main technical systems within the collider is presented to the level of the best current information.
△ Less
Submitted 29 April, 2015;
originally announced April 2015.
-
The Highly Miniaturised Radiation Monitor
Authors:
E. F. Mitchell,
H. M. Araújo,
E. Daly,
N. Guerrini,
S. Gunes-Lasnet,
D. Griffin,
A. Marshall,
A. Menicucci,
T. Morse,
O. Poyntz-Wright,
R. Turchetta,
S. Woodward
Abstract:
We present the design and preliminary calibration results of a novel highly miniaturised particle radiation monitor (HMRM) for spacecraft use. The HMRM device comprises a telescopic configuration of active pixel sensors enclosed in a titanium shield, with an estimated total mass of 52 g and volume of 15 cm$^3$. The monitor is intended to provide real-time dosimetry and identification of energetic…
▽ More
We present the design and preliminary calibration results of a novel highly miniaturised particle radiation monitor (HMRM) for spacecraft use. The HMRM device comprises a telescopic configuration of active pixel sensors enclosed in a titanium shield, with an estimated total mass of 52 g and volume of 15 cm$^3$. The monitor is intended to provide real-time dosimetry and identification of energetic charged particles in fluxes of up to 10$^8$ cm$^{-2}$ s$^{-1}$ (omnidirectional). Achieving this capability with such a small instrument could open new prospects for radiation detection in space.
△ Less
Submitted 20 June, 2014; v1 submitted 15 January, 2014;
originally announced January 2014.
-
Stochastic Modeling of Soil Salinity
Authors:
S. Suweis,
A. Rinaldo,
S. E. A. T. M. Van der Zee,
E. Daly,
A. Maritan,
A. Porporato
Abstract:
A minimalist stochastic model of primary soil salinity is proposed, in which the rate of soil salinization is determined by the balance between dry and wet salt deposition and the intermittent leaching events caused by rainfall events. The long term probability density functions of salt mass and concentration are found by reducing the coupled soil moisture and salt mass balance equation to a singl…
▽ More
A minimalist stochastic model of primary soil salinity is proposed, in which the rate of soil salinization is determined by the balance between dry and wet salt deposition and the intermittent leaching events caused by rainfall events. The long term probability density functions of salt mass and concentration are found by reducing the coupled soil moisture and salt mass balance equation to a single stochastic differential equation driven by multiplicative Poisson noise. The novel analytical solutions provide insight on the interplay of the main soil, plant and climate parameters responsible for long-term soil salinization. In particular, they show the existence of two distinct regimes, one where the mean salt mass remains nearly constant (or decreases) with increasing rainfall frequency, and another where mean salt content increases markedly with increasing rainfall frequency. As a result, relatively small reductions of rainfall in drier climates may entail dramatic shifts in long-term soil salinization trends, with significant consequences e.g. for climate change impacts on rain-fed agriculture.
△ Less
Submitted 10 July, 2012;
originally announced July 2012.