-
"I Want It That Way": Enabling Interactive Decision Support Using Large Language Models and Constraint Programming
Authors:
Connor Lawless,
Jakob Schoeffer,
Lindy Le,
Kael Rowan,
Shilad Sen,
Cristina St. Hill,
**a Suh,
Bahareh Sarrafzadeh
Abstract:
A critical factor in the success of decision support systems is the accurate modeling of user preferences. Psychology research has demonstrated that users often develop their preferences during the elicitation process, highlighting the pivotal role of system-user interaction in develo** personalized systems. This paper introduces a novel approach, combining Large Language Models (LLMs) with Cons…
▽ More
A critical factor in the success of decision support systems is the accurate modeling of user preferences. Psychology research has demonstrated that users often develop their preferences during the elicitation process, highlighting the pivotal role of system-user interaction in develo** personalized systems. This paper introduces a novel approach, combining Large Language Models (LLMs) with Constraint Programming to facilitate interactive decision support. We study this hybrid framework through the lens of meeting scheduling, a time-consuming daily activity faced by a multitude of information workers. We conduct three studies to evaluate the novel framework, including a diary study (n=64) to characterize contextual scheduling preferences, a quantitative evaluation of the system's performance, and a user study (n=10) with a prototype system. Our work highlights the potential for a hybrid LLM and optimization approach for iterative preference elicitation and design considerations for building systems that support human-system collaborative decision-making processes.
△ Less
Submitted 12 February, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
NCL-SM: A Fully Annotated Dataset of Images from Human Skeletal Muscle Biopsies
Authors:
Atif Khan,
Conor Lawless,
Amy Vincent,
Charlotte Warren,
Valeria Di Leo,
Tiago Gomes,
A. Stephen McGough
Abstract:
Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machi…
▽ More
Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machine learning (ML) models, both followed by labour intensive and subjective manual interventions to fine-tune segmentation. We believe that fully automated, precise, reproducible segmentation is possible by training ML models. However, in this important biomedical domain, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper we release NCL-SM: a high quality bioimaging dataset of 46 human SM tissue cross-sections from both healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include $>$ 50k manually segmented muscle fibres (myofibres). In addition we also curated high quality myofibre segmentations, annotating reasons for rejecting low quality myofibres and low quality regions in SM tissue images, making these annotations completely ready for downstream analysis. This, we believe, will pave the way for development of a fully automatic pipeline that identifies individual myofibres within images of tissue sections and, in particular, also classifies individual myofibres that are fit for further analysis.
△ Less
Submitted 25 November, 2023;
originally announced November 2023.
-
Introducing NCL-SM: A Fully Annotated Dataset of Images from Human Skeletal Muscle Biopsies
Authors:
Atif Khan,
Conor Lawless,
Amy Vincent,
Charlotte Warren,
Valeria Di Leo,
Tiago Gomes,
A. Stephen McGough
Abstract:
Single cell analysis of skeletal muscle (SM) tissue is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be precise. There is currently no tool or pipeline that makes automatic and precise segmentation and curation of images of SM tissu…
▽ More
Single cell analysis of skeletal muscle (SM) tissue is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be precise. There is currently no tool or pipeline that makes automatic and precise segmentation and curation of images of SM tissue cross-sections possible. Biomedical scientists in this field rely on custom tools and general machine learning (ML) models, both followed by labour intensive and subjective manual interventions to get the segmentation right. We believe that automated, precise, reproducible segmentation is possible by training ML models. However, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper we release NCL-SM: a high quality bioimaging dataset of 46 human tissue sections from healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include $>$ 50k manually segmented muscle fibres (myofibres). In addition we also curated high quality myofibres and annotated reasons for rejecting low quality myofibres and regions in SM tissue images, making this data completely ready for downstream analysis. This, we believe, will pave the way for development of a fully automatic pipeline that identifies individual myofibres within images of tissue sections and, in particular, also classifies individual myofibres that are fit for further analysis.
△ Less
Submitted 18 November, 2023;
originally announced November 2023.
-
Asymptotics of approximate Bayesian computation when summary statistics converge at heterogeneous rates
Authors:
Caroline Lawless,
Christian P. Robert,
Judith Rousseau,
Robin J. Ryder
Abstract:
We consider the asymptotic properties of Approximate Bayesian Computation (ABC) for the realistic case of summary statistics with heterogeneous rates of convergence. We allow some statistics to converge faster than the ABC tolerance, other statistics to converge slower, and cover the case where some statistics do not converge at all. We give conditions for the ABC posterior to converge, and provid…
▽ More
We consider the asymptotic properties of Approximate Bayesian Computation (ABC) for the realistic case of summary statistics with heterogeneous rates of convergence. We allow some statistics to converge faster than the ABC tolerance, other statistics to converge slower, and cover the case where some statistics do not converge at all. We give conditions for the ABC posterior to converge, and provide an explicit representation of the shape of the ABC posterior distribution in our general setting; in particular, we show how the shape of the posterior depends on the number of slow statistics. We then quantify the gain brought by the local linear post-processing step.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Fair Minimum Representation Clustering
Authors:
Connor Lawless,
Oktay Gunluk
Abstract:
Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g. electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g. 50\% to elect their desired candidate). This paper considers the problem of performing k-means clusteri…
▽ More
Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g. electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g. 50\% to elect their desired candidate). This paper considers the problem of performing k-means clustering while ensuring groups (e.g. demographic groups) have that minimum level of representation in a specified number of clusters. We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes where certain groups lack sufficient representation past the minimum threshold in a proportional number of clusters. We formulate the problem through a mixed-integer optimization framework and present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints. We show that incorporating the fairness criteria leads to a NP-Hard sub-problem within Lloyd's algorithm, but we provide computational approaches that make the problem tractable for even large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the k-means clustering cost across standard benchmark datasets.
△ Less
Submitted 8 February, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
A Note on Task-Aware Loss via Reweighing Prediction Loss by Decision-Regret
Authors:
Connor Lawless,
Angela Zhou
Abstract:
In this short technical note we propose a baseline for decision-aware learning for contextual linear optimization, which solves stochastic linear optimization when cost coefficients can be predicted based on context information. We propose a decision-aware version of predict-then-optimize. We reweigh the prediction error by the decision regret incurred by an (unweighted) pilot estimator of costs t…
▽ More
In this short technical note we propose a baseline for decision-aware learning for contextual linear optimization, which solves stochastic linear optimization when cost coefficients can be predicted based on context information. We propose a decision-aware version of predict-then-optimize. We reweigh the prediction error by the decision regret incurred by an (unweighted) pilot estimator of costs to obtain a decision-aware predictor, then optimize with cost predictions from the decision-aware predictor. This method can be motivated as a finite-difference, iterate-independent approximation of the gradients of previously proposed end-to-end learning algorithms; it is also consistent with previously suggested intuition for end-to-end learning. This baseline is computationally easy to implement with readily available reweighted prediction oracles and linear optimization, and can be implemented with convex optimization so long as the prediction error minimization is convex. Empirically, we demonstrate that this approach can lead to improvements over a "predict-then-optimize" framework for settings with misspecified models, and is competitive with other end-to-end approaches. Therefore, due to its simplicity and ease of use, we suggest it as a simple baseline for end-to-end and decision-aware learning.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Explainable Deep Learning to Profile Mitochondrial Disease Using High Dimensional Protein Expression Data
Authors:
Atif Khan,
Conor Lawless,
Amy E Vincent,
Satish Pilla,
Sushanth Ramesh,
A. Stephen McGough
Abstract:
Mitochondrial diseases are currently untreatable due to our limited understanding of their pathology. We study the expression of various mitochondrial proteins in skeletal myofibres (SM) in order to discover processes involved in mitochondrial pathology using Imaging Mass Cytometry (IMC). IMC produces high dimensional multichannel pseudo-images representing spatial variation in the expression of a…
▽ More
Mitochondrial diseases are currently untreatable due to our limited understanding of their pathology. We study the expression of various mitochondrial proteins in skeletal myofibres (SM) in order to discover processes involved in mitochondrial pathology using Imaging Mass Cytometry (IMC). IMC produces high dimensional multichannel pseudo-images representing spatial variation in the expression of a panel of proteins within a tissue, including subcellular variation. Statistical analysis of these images requires semi-automated annotation of thousands of SMs in IMC images of patient muscle biopsies. In this paper we investigate the use of deep learning (DL) on raw IMC data to analyse it without any manual pre-processing steps, statistical summaries or statistical models. For this we first train state-of-art computer vision DL models on all available image channels, both combined and individually. We observed better than expected accuracy for many of these models. We then apply state-of-the-art explainable techniques relevant to computer vision DL to find the basis of the predictions of these models. Some of the resulting visual explainable maps highlight features in the images that appear consistent with the latest hypotheses about mitochondrial disease progression within myofibres.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Cluster Explanation via Polyhedral Descriptions
Authors:
Connor Lawless,
Oktay Gunluk
Abstract:
Clustering is an unsupervised learning problem that aims to partition unlabelled data points into groups with similar features. Traditional clustering algorithms provide limited insight into the groups they find as their main focus is accuracy and not the interpretability of the group assignments. This has spurred a recent line of work on explainable machine learning for clustering. In this paper…
▽ More
Clustering is an unsupervised learning problem that aims to partition unlabelled data points into groups with similar features. Traditional clustering algorithms provide limited insight into the groups they find as their main focus is accuracy and not the interpretability of the group assignments. This has spurred a recent line of work on explainable machine learning for clustering. In this paper we focus on the cluster description problem where, given a dataset and its partition into clusters, the task is to explain the clusters. We introduce a new approach to explain clusters by constructing polyhedra around each cluster while minimizing either the complexity of the resulting polyhedra or the number of features used in the description. We formulate the cluster description problem as an integer program and present a column generation approach to search over an exponential number of candidate half-spaces that can be used to build the polyhedra. To deal with large datasets, we introduce a novel grou** scheme that first forms smaller groups of data points and then builds the polyhedra around the grouped data, a strategy which out-performs simply sub-sampling data. Compared to state of the art cluster description algorithms, our approach is able to achieve competitive interpretability with improved description accuracy.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Interpretable Clustering via Multi-Polytope Machines
Authors:
Connor Lawless,
Jayant Kalagnanam,
Lam M. Nguyen,
Dzung Phan,
Chandra Reddy
Abstract:
Clustering is a popular unsupervised learning tool often used to discover groups within a larger population such as customer segments, or patient subtypes. However, despite its use as a tool for subgroup discovery and description - few state-of-the-art algorithms provide any rationale or description behind the clusters found. We propose a novel approach for interpretable clustering that both clust…
▽ More
Clustering is a popular unsupervised learning tool often used to discover groups within a larger population such as customer segments, or patient subtypes. However, despite its use as a tool for subgroup discovery and description - few state-of-the-art algorithms provide any rationale or description behind the clusters found. We propose a novel approach for interpretable clustering that both clusters data points and constructs polytopes around the discovered clusters to explain them. Our framework allows for additional constraints on the polytopes - including ensuring that the hyperplanes constructing the polytope are axis-parallel or sparse with integer coefficients. We formulate the problem of constructing clusters via polytopes as a Mixed-Integer Non-Linear Program (MINLP). To solve our formulation we propose a two phase approach where we first initialize clusters and polytopes using alternating minimization, and then use coordinate descent to boost clustering performance. We benchmark our approach on a suite of synthetic and real world clustering problems, where our algorithm outperforms state of the art interpretable and non-interpretable clustering algorithms.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
Interpretable and Fair Boolean Rule Sets via Column Generation
Authors:
Connor Lawless,
Sanjeeb Dash,
Oktay Gunluk,
Dennis Wei
Abstract:
This paper considers the learning of Boolean rules in disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. We also consider the fairness setting and extend the formulation to include explicit constraints on two different measures of c…
▽ More
This paper considers the learning of Boolean rules in disjunctive normal form (DNF, OR-of-ANDs, equivalent to decision rule sets) as an interpretable model for classification. An integer program is formulated to optimally trade classification accuracy for rule simplicity. We also consider the fairness setting and extend the formulation to include explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation (CG) is used to efficiently search over an exponential number of candidate rules without the need for heuristic rule mining. To handle large data sets, we propose an approximate CG algorithm using randomization. Compared to three recently proposed alternatives, the CG algorithm dominates the accuracy-simplicity trade-off in 8 out of 16 data sets. When maximized for accuracy, CG is competitive with rule learners designed for this purpose, sometimes finding significantly simpler solutions that are no less accurate. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.
△ Less
Submitted 18 September, 2023; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Fair Decision Rules for Binary Classification
Authors:
Connor Lawless,
Oktay Gunluk
Abstract:
In recent years, machine learning has begun automating decision making in fields as varied as college admissions, credit lending, and criminal sentencing. The socially sensitive nature of some of these applications together with increasing regulatory constraints has necessitated the need for algorithms that are both fair and interpretable. In this paper we consider the problem of building Boolean…
▽ More
In recent years, machine learning has begun automating decision making in fields as varied as college admissions, credit lending, and criminal sentencing. The socially sensitive nature of some of these applications together with increasing regulatory constraints has necessitated the need for algorithms that are both fair and interpretable. In this paper we consider the problem of building Boolean rule sets in disjunctive normal form (DNF), an interpretable model for binary classification, subject to fairness constraints. We formulate the problem as an integer program that maximizes classification accuracy with explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation framework, with a novel formulation, is used to efficiently search over exponentially many possible rules. When combined with faster heuristics, our method can deal with large data-sets. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
A simple proof of Pitman-Yor's Chinese restaurant process from its stick-breaking representation
Authors:
Caroline Lawless,
Julyan Arbel
Abstract:
For a long time, the Dirichlet process has been the gold standard discrete random measure in Bayesian nonparametrics. The Pitman--Yor process provides a simple and mathematically tractable generalization, allowing for a very flexible control of the clustering behaviour. Two commonly used representations of the Pitman--Yor process are the stick-breaking process and the Chinese restaurant process. T…
▽ More
For a long time, the Dirichlet process has been the gold standard discrete random measure in Bayesian nonparametrics. The Pitman--Yor process provides a simple and mathematically tractable generalization, allowing for a very flexible control of the clustering behaviour. Two commonly used representations of the Pitman--Yor process are the stick-breaking process and the Chinese restaurant process. The former is a constructive representation of the process which turns out very handy for practical implementation, while the latter describes the partition distribution induced. However, the usual proof of the connection between them is indirect and involves measure theory. We provide here an elementary proof of Pitman--Yor's Chinese Restaurant process from its stick-breaking representation.
△ Less
Submitted 24 October, 2018; v1 submitted 15 October, 2018;
originally announced October 2018.
-
Bayesian hierarchical modelling for inferring genetic interactions in yeast
Authors:
Jonathan Heydari,
Conor Lawless,
David A. Lydall,
Darren J. Wilkinson
Abstract:
Quantitative Fitness Analysis (QFA) is a high-throughput experimental and computational methodology for measuring the growth of microbial populations. QFA screens can be used to compare the health of cell populations with and without a mutation in a query gene in order to infer genetic interaction strengths genome-wide, examining thousands of separate genotypes. We introduce Bayesian, hierarchical…
▽ More
Quantitative Fitness Analysis (QFA) is a high-throughput experimental and computational methodology for measuring the growth of microbial populations. QFA screens can be used to compare the health of cell populations with and without a mutation in a query gene in order to infer genetic interaction strengths genome-wide, examining thousands of separate genotypes. We introduce Bayesian, hierarchical models of population growth rates and genetic interactions that better reflect QFA experimental design than current approaches. Our new approach models population dynamics and genetic interaction simultaneously, thereby avoiding passing information between models via a univariate fitness summary. Matching experimental structure more closely, Bayesian hierarchical approaches use data more efficiently and find new evidence for genes which interact with yeast telomeres within a published dataset.
△ Less
Submitted 14 August, 2015;
originally announced August 2015.
-
Fast Bayesian parameter estimation for stochastic logistic growth models
Authors:
Jonathan Heydari,
Conor Lawless,
David A. Lydall,
Darren J. Wilkinson
Abstract:
The transition density of a stochastic, logistic population growth model with multiplicative intrinsic noise is analytically intractable. Inferring model parameter values by fitting such stochastic differential equation (SDE) models to data therefore requires relatively slow numerical simulation. Where such simulation is prohibitively slow, an alternative is to use model approximations which do ha…
▽ More
The transition density of a stochastic, logistic population growth model with multiplicative intrinsic noise is analytically intractable. Inferring model parameter values by fitting such stochastic differential equation (SDE) models to data therefore requires relatively slow numerical simulation. Where such simulation is prohibitively slow, an alternative is to use model approximations which do have an analytically tractable transition density, enabling fast inference. We introduce two such approximations, with either multiplicative or additive intrinsic noise, each derived from the linear noise approximation of the logistic growth SDE. After Bayesian inference we find that our fast LNA models, using Kalman filter recursion for computation of marginal likelihoods, give similar posterior distributions to slow arbitrarily exact models. We also demonstrate that simulations from our LNA models better describe the characteristics of the stochastic logistic growth models than a related approach. Finally, we demonstrate that our LNA model with additive intrinsic noise and measurement error best describes an example set of longitudinal observations of microbial population size taken from a typical, genome-wide screening experiment.
△ Less
Submitted 26 October, 2013; v1 submitted 21 October, 2013;
originally announced October 2013.