Search | arXiv e-print repository

arXiv:2406.19314 [pdf, other]

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Authors: Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, Micah Goldblum

Abstract: Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In… ▽ More Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.08391 [pdf, other]

Large Language Models Must Be Taught to Know What They Don't Know

Authors: Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, Andrew Gordon Wilson

Abstract: When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibrati… ▽ More When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Code available at: https://github.com/activatedgeek/calibration-tuning

arXiv:2402.18213 [pdf, other]

Multi-objective Differentiable Neural Architecture Search

Authors: Rhea Sanjay Sukthanker, Arber Zela, Benedikt Staffler, Samuel Dooley, Josif Grabocka, Frank Hutter

Abstract: Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints in… ▽ More Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a computationally expensive search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without extra costs, our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k, an encoder-decoder transformer space for machine translation and a decoder-only transformer space for language modelling. △ Less

Submitted 19 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 37 pages, 27 figures

arXiv:2402.13228 [pdf, other]

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Authors: Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a r… ▽ More Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard. △ Less

Submitted 3 July, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2311.01933 [pdf, other]

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

Authors: Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, Colin White

Abstract: The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called… ▽ More The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Journal ref: Thirty-seventh Conference on Neural Information Processing Systems, 2023

arXiv:2310.12145 [pdf, other]

Fairer and More Accurate Tabular Models Through NAS

Authors: Richeek Das, Samuel Dooley

Abstract: Making models algorithmically fairer in tabular data has been long studied, with techniques typically oriented towards fixes which usually take a neural model with an undesirable outcome and make changes to how the data are ingested, what the model weights are, or how outputs are processed. We employ an emergent and different strategy where we consider updating the model's architecture and trainin… ▽ More Making models algorithmically fairer in tabular data has been long studied, with techniques typically oriented towards fixes which usually take a neural model with an undesirable outcome and make changes to how the data are ingested, what the model weights are, or how outputs are processed. We employ an emergent and different strategy where we consider updating the model's architecture and training hyperparameters to find an entirely new model with better outcomes from the beginning of the debiasing procedure. In this work, we propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data. We conduct extensive exploration of architectural and hyperparameter spaces (MLP, ResNet, and FT-Transformer) across diverse datasets, demonstrating the dependence of accuracy and fairness metrics of model predictions on hyperparameter combinations. We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns. We propose a novel approach that jointly optimizes architectural and training hyperparameters in a multi-objective constraint of both accuracy and fairness. We produce architectures that consistently Pareto dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both, all of this while being Pareto-optimal over hyperparameters achieved through single-objective (accuracy) optimization runs. This research underscores the promise of automating fairness and accuracy optimization in deep learning models. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10628 [pdf, other]

Data Contamination Through the Lens of Time

Authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley

Abstract: Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure… ▽ More Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2308.10882 [pdf, other]

Giraffe: Adventures in Expanding Context Lengths in LLMs

Authors: Arka Pal, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, Siddartha Naidu

Abstract: Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of w… ▽ More Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of which focus on modifying the system of positional encodings used in the attention mechanism to indicate where tokens or activations are located in the input sequence. We conduct a wide survey of existing methods of context length extrapolation on a base LLaMA or LLaMA 2 model, and introduce some of our own design as well -- in particular, a new truncation strategy for modifying the basis for the position encoding. We test these methods using three new evaluation tasks (FreeFormQA, AlteredNumericQA, and LongChat-Lines) as well as perplexity, which we find to be less fine-grained as a measure of long context performance of LLMs. We release the three tasks publicly as datasets on HuggingFace. We discover that linear scaling is the best method for extending context length, and show that further gains can be achieved by using longer scales at evaluation time. We also discover promising extrapolation capabilities in the truncated basis. To support further research in this area, we release three new 13B parameter long-context models which we call Giraffe: 4k and 16k context models trained from base LLaMA-13B, and a 32k context model trained from base LLaMA2-13B. We also release the code to replicate our results. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2211.15937 [pdf, other]

Robustness Disparities in Face Detection

Authors: Samuel Dooley, George Z. Wei, Tom Goldstein, John P. Dickerson

Abstract: Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a… ▽ More Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a fairness perspective: face detection, sometimes called face localization. Since face detection is a pre-requisite step in facial analysis systems, the bias we observe in face detection will flow downstream to the other components like facial recognition and emotion prediction. Additionally, no prior work has focused on the robustness of these systems under various perturbations and corruptions, which leaves open the question of how various people are impacted by these phenomena. We present the first of its kind detailed benchmark of face detection systems, specifically examining the robustness to noise of commercial and academic models. We use both standard and recently released academic facial datasets to quantitatively analyze trends in face detection robustness. Across all the datasets and systems, we generally find that photos of individuals who are $\textit{masculine presenting}$, $\textit{older}$, of $\textit{darker skin type}$, or have $\textit{dim lighting}$ are more susceptible to errors than their counterparts in other identities. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: NeurIPS Datasets & Benchmarks Track 2022

arXiv:2211.03554 [pdf, other]

How Technology Impacts and Compares to Humans in Socially Consequential Arenas

Authors: Samuel Dooley

Abstract: One of the main promises of technology development is for it to be adopted by people, organizations, societies, and governments -- incorporated into their life, work stream, or processes. Often, this is socially beneficial as it automates mundane tasks, frees up more time for other more important things, or otherwise improves the lives of those who use the technology. However, these beneficial res… ▽ More One of the main promises of technology development is for it to be adopted by people, organizations, societies, and governments -- incorporated into their life, work stream, or processes. Often, this is socially beneficial as it automates mundane tasks, frees up more time for other more important things, or otherwise improves the lives of those who use the technology. However, these beneficial results do not apply in every scenario and may not impact everyone in a system the same way. Sometimes a technology is developed which produces both benefits and inflicts some harm. These harms may come at a higher cost to some people than others, raising the question: {\it how are benefits and harms weighed when deciding if and how a socially consequential technology gets developed?} The most natural way to answer this question, and in fact how people first approach it, is to compare the new technology to what used to exist. As such, in this work, I make comparative analyses between humans and machines in three scenarios and seek to understand how sentiment about a technology, performance of that technology, and the impacts of that technology combine to influence how one decides to answer my main research question. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Doctoral thesis proposal. arXiv admin note: substantial text overlap with arXiv:2110.08396, arXiv:2108.12508, arXiv:2006.12621

arXiv:2210.09943 [pdf, other]

Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition

Authors: Samuel Dooley, Rhea Sanjay Sukthanker, John P. Dickerson, Colin White, Frank Hutter, Micah Goldblum

Abstract: Face recognition systems are widely deployed in safety-critical applications, including law enforcement, yet they exhibit bias across a range of socio-demographic dimensions, such as gender and race. Conventional wisdom dictates that model biases arise from biased training data. As a consequence, previous works on bias mitigation largely focused on pre-processing the training data, adding penaltie… ▽ More Face recognition systems are widely deployed in safety-critical applications, including law enforcement, yet they exhibit bias across a range of socio-demographic dimensions, such as gender and race. Conventional wisdom dictates that model biases arise from biased training data. As a consequence, previous works on bias mitigation largely focused on pre-processing the training data, adding penalties to prevent bias from effecting the model during training, or post-processing predictions to debias them, yet these approaches have shown limited success on hard problems such as face recognition. In our work, we discover that biases are actually inherent to neural network architectures themselves. Following this reframing, we conduct the first neural architecture search for fairness, jointly with a search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other high-performance architectures and existing bias mitigation methods in terms of accuracy and fairness, often by large margins, on the two most widely used datasets for face identification, CelebA and VGGFace2. Furthermore, these models generalize to other datasets and sensitive attributes. We release our code, models and raw data files at https://github.com/dooleys/FR-NAS. △ Less

Submitted 6 December, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

arXiv:2203.08235 [pdf, other]

A Deep Dive into Dataset Imbalance and Bias in Face Identification

Authors: Valeriia Cherepanova, Steven Reich, Samuel Dooley, Hossein Souri, Micah Goldblum, Tom Goldstein

Abstract: As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic researc… ▽ More As the deployment of automated face recognition (FR) systems proliferates, bias in these systems is not just an academic question, but a matter of public concern. Media portrayals often center imbalance as the main source of bias, i.e., that FR models perform worse on images of non-white people or women because these demographic groups are underrepresented in training data. Recent academic research paints a more nuanced picture of this relationship. However, previous studies of data imbalance in FR have focused exclusively on the face verification setting, while the face identification setting has been largely ignored, despite being deployed in sensitive applications such as law enforcement. This is an unfortunate omission, as 'imbalance' is a more complex matter in identification; imbalance may arise in not only the training data, but also the testing data, and furthermore may affect the proportion of identities belonging to each demographic group or the number of images belonging to each identity. In this work, we address this gap in the research by thoroughly exploring the effects of each kind of imbalance possible in face identification, and discuss other factors which may impact bias in this setting. △ Less

Submitted 15 March, 2022; originally announced March 2022.

arXiv:2203.00565 [pdf, other]

Topological Data Analysis for Word Sense Disambiguation

Authors: Michael Rawson, Samuel Dooley, Mithun Bharadwaj, Rishabh Choudhary

Abstract: We develop and test a novel unsupervised algorithm for word sense induction and disambiguation which uses topological data analysis. Typical approaches to the problem involve clustering, based on simple low level features of distance in word embeddings. Our approach relies on advanced mathematical concepts in the field of topology which provides a richer conceptualization of clusters for the word… ▽ More We develop and test a novel unsupervised algorithm for word sense induction and disambiguation which uses topological data analysis. Typical approaches to the problem involve clustering, based on simple low level features of distance in word embeddings. Our approach relies on advanced mathematical concepts in the field of topology which provides a richer conceptualization of clusters for the word sense induction tasks. We use a persistent homology barcode algorithm on the SemCor dataset and demonstrate that our approach gives low relative error on word sense induction. This shows the promise of topological algorithms for natural language processing and we advocate for future work in this promising area. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2202.11095 [pdf, other]

The Dichotomous Affiliate Stable Matching Problem: Approval-Based Matching with Applicant-Employer Relations

Authors: Marina Knittel, Samuel Dooley, John P. Dickerson

Abstract: While the stable marriage problem and its variants model a vast range of matching markets, they fail to capture complex agent relationships, such as the affiliation of applicants and employers in an interview marketplace. To model this problem, the existing literature on matching with externalities permits agents to provide complete and total rankings over matchings based off of both their own and… ▽ More While the stable marriage problem and its variants model a vast range of matching markets, they fail to capture complex agent relationships, such as the affiliation of applicants and employers in an interview marketplace. To model this problem, the existing literature on matching with externalities permits agents to provide complete and total rankings over matchings based off of both their own and their affiliates' matches. This complete ordering restriction is unrealistic, and further the model may have an empty core. To address this, we introduce the Dichotomous Affiliate Stable Matching (DASM) Problem, where agents' preferences indicate dichotomous acceptance or rejection of another agent in the marketplace, both for themselves and their affiliates. We also assume the agent's preferences over entire matchings are determined by a general weighted valuation function of their (and their affiliates') matches. Our results are threefold: (1) we use a human study to show that real-world matching rankings follow our assumed valuation function; (2) we prove that there always exists a stable solution by providing an efficient, easily-implementable algorithm that finds such a solution; and (3) we experimentally validate the efficiency of our algorithm versus a linear-programming-based approach. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: 19 pages, 2 figures

arXiv:2202.10194 [pdf]

doi 10.1016/j.proci.2022.07.181

Low-Dimensional High-Fidelity Kinetic Models for NOX Formation by a Compute Intensification Method

Authors: Mark Kelly, Harry Dunne, Gilles Bourque, Stephen Dooley

Abstract: A novel compute intensification methodology to the construction of low-dimensional, high-fidelity "compact" kinetic models for NOX formation is designed and demonstrated. The method adapts the data intensive Machine Learned Optimization of Chemical Kinetics (MLOCK) algorithm for compact model generation by the use of a Latin Square method for virtual reaction network generation. A set of logical r… ▽ More A novel compute intensification methodology to the construction of low-dimensional, high-fidelity "compact" kinetic models for NOX formation is designed and demonstrated. The method adapts the data intensive Machine Learned Optimization of Chemical Kinetics (MLOCK) algorithm for compact model generation by the use of a Latin Square method for virtual reaction network generation. A set of logical rules are defined which construct a minimally sized virtual reaction network comprising three additional nodes (N, NO, NO2). This NOX virtual reaction network is appended to a pre-existing compact model for methane combustion comprising fifteen nodes. The resulting eighteen node virtual reaction network is processed by the MLOCK coded algorithm to produce a plethora of compact model candidates for NOX formation during methane combustion. MLOCK automatically; populates the terms of the virtual reaction network with candidate inputs; measures the success of the resulting compact model candidates (in reproducing a broad set of gas turbine industry-defined performance targets); selects regions of input parameters space showing models of best performance; refines the input parameters to give better performance; and makes an ultimate selection of the best performing model or models. By this method, it is shown that a number of compact model candidates exist that show fidelities in excess of 75% in reproducing industry defined performance targets, with one model valid to >75% across fuel/air equivalence ratios of 0.5-1.0. However, to meet the full fuel/air equivalence ratio performance envelope defined by industry, we show that with this minimal virtual reaction network, two further compact models are required. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2202.08021

arXiv:2202.08021 [pdf]

doi 10.1016/j.combustflame.2023.112755

Toward Development of Machine Learned Techniques for Production of Compact Kinetic Models

Authors: Mark Kelly, Mark Fortune, Gilles Bourque, Stephen Dooley

Abstract: Chemical kinetic models are an essential component in the development and optimisation of combustion devices through their coupling to multi-dimensional simulations such as computational fluid dynamics (CFD). Low-dimensional kinetic models which retain good fidelity to the reality are needed, the production of which requires considerable human-time cost and expert knowledge. Here, we present a nov… ▽ More Chemical kinetic models are an essential component in the development and optimisation of combustion devices through their coupling to multi-dimensional simulations such as computational fluid dynamics (CFD). Low-dimensional kinetic models which retain good fidelity to the reality are needed, the production of which requires considerable human-time cost and expert knowledge. Here, we present a novel automated compute intensification methodology to produce overly-reduced and optimised (compact) chemical kinetic models. This algorithm, termed Machine Learned Optimisation of Chemical Kinetics (MLOCK), systematically perturbs each of the four sub-models of a chemical kinetic model to discover what combinations of terms results in a good model. A virtual reaction network comprised of n species is first obtained using conventional mechanism reduction. To counteract the imposed decrease in model performance, the weights (virtual reaction rate constants) of important connections (virtual reactions) between each node (species) of the virtual reaction network are numerically optimised to replicate selected calculations across four sequential phases. The first version of MLOCK, (MLOCK1.0) simultaneously perturbs all three virtual Arrhenius reaction rate constant parameters for important connections and assesses the suitability of the new parameters through objective error functions, which quantify the error in each compact model candidate's calculation of the optimisation targets, which may be comprised of detailed model calculations and/or experimental data. MLOCK1.0 is demonstrated by creating compact models for the archetypal case of methane air combustion. It is shown that the NUGMECH1.0 detailed model comprised of 2,789 species is reliably compacted to 15 species (nodes), whilst retaining an overall fidelity of ~87% to the detailed model calculations, outperforming the prior state-of-art. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2201.10047

Are Commercial Face Detection Models as Biased as Academic Models?

Authors: Samuel Dooley, George Z. Wei, Tom Goldstein, John P. Dickerson

Abstract: As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived r… ▽ More As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived race, skin type, etc. These studies and audits often examine algorithms which fall into two categories: academic models or commercial models. We present a detailed comparison between academic and commercial face detection systems, specifically examining robustness to noise. We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness, specifically by having statistically significant decreased performance on older individuals and those who present their gender in a masculine manner. When we compare the size of these disparities to that of commercial models, we conclude that commercial models - in contrast to their relatively larger development budget and industry-level fairness commitments - are always as biased or more biased than an academic model. △ Less

Submitted 29 November, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: This preprint and arXiv:2108.12508 were combined and a more rigorous analysis added to result in the NeurIPS Datasets & Benchmark 2022 paper arXiv:2211.15937

arXiv:2110.09437 [pdf, other]

Ctrl-Shift: How Privacy Sentiment Changed from 2019 to 2021

Authors: Angelica Goetzen, Samuel Dooley, Elissa M. Redmiles

Abstract: People's privacy sentiments influence changes in legislation as well as technology design and use. While single-point-in-time investigations of privacy sentiment offer useful insight, study of people's privacy sentiments over time is also necessary to better understand and anticipate evolving privacy attitudes. In this work, we use repeated cross-sectional surveys (n=6,676) to model the sentiments… ▽ More People's privacy sentiments influence changes in legislation as well as technology design and use. While single-point-in-time investigations of privacy sentiment offer useful insight, study of people's privacy sentiments over time is also necessary to better understand and anticipate evolving privacy attitudes. In this work, we use repeated cross-sectional surveys (n=6,676) to model the sentiments of people in the U.S. toward collection and use of data for government- and health-related purposes from 2019-2021. After the onset of COVID-19, we observe significant decreases in respondent acceptance of government data use and significant increases in acceptance of health-related data uses. While differences in privacy attitudes between sociodemographic groups largely decreased over this time period, following the 2020 U.S. national elections, we observe some of the first evidence that privacy sentiments may change based on the alignment between a user's politics and the political party in power. Our results offer insight into how privacy attitudes may have been impacted by recent events and allow us to identify potential predictors of changes in privacy attitudes during times of geopolitical or national change. △ Less

Submitted 15 March, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

arXiv:2110.08396 [pdf, other]

Comparing Human and Machine Bias in Face Recognition

Authors: Samuel Dooley, Ryan Downing, George Wei, Nathan Shankar, Bradon Thymes, Gudrun Thorkelsdottir, Tiye Kurtz-Miott, Rachel Mattson, Olufemi Obiwumi, Valeriia Cherepanova, Micah Goldblum, John P Dickerson, Tom Goldstein

Abstract: Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack qu… ▽ More Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Computer models are observed to achieve a higher level of accuracy than the survey participants on both tasks and exhibit bias to similar degrees as the human survey participants. △ Less

Submitted 25 October, 2021; v1 submitted 15 October, 2021; originally announced October 2021.

arXiv:2108.12508 [pdf, other]

Robustness Disparities in Commercial Face Detection

Authors: Samuel Dooley, Tom Goldstein, John P. Dickerson

Abstract: Facial detection and analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Critiques that focus on system performance analyze disparity of the system's output, i.e., how frequently is a face detected for different Fitzpatrick skin types or perceived genders. However, we focus on the robustness of these system outputs under noisy natural… ▽ More Facial detection and analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Critiques that focus on system performance analyze disparity of the system's output, i.e., how frequently is a face detected for different Fitzpatrick skin types or perceived genders. However, we focus on the robustness of these system outputs under noisy natural perturbations. We present the first of its kind detailed benchmark of the robustness of three such systems: Amazon Rekognition, Microsoft Azure, and Google Cloud Platform. We use both standard and recently released academic facial datasets to quantitatively analyze trends in robustness for each. Across all the datasets and systems, we generally find that photos of individuals who are older, masculine presenting, of darker skin type, or have dim lighting are more susceptible to errors than their counterparts in other identities. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2106.03215 [pdf, other]

PreferenceNet: Encoding Human Preferences in Auction Design with Deep Learning

Authors: Neehar Peri, Michael J. Curry, Samuel Dooley, John P. Dickerson

Abstract: The design of optimal auctions is a problem of interest in economics, game theory and computer science. Despite decades of effort, strategyproof, revenue-maximizing auction designs are still not known outside of restricted settings. However, recent methods using deep learning have shown some success in approximating optimal auctions, recovering several known solutions and outperforming strong base… ▽ More The design of optimal auctions is a problem of interest in economics, game theory and computer science. Despite decades of effort, strategyproof, revenue-maximizing auction designs are still not known outside of restricted settings. However, recent methods using deep learning have shown some success in approximating optimal auctions, recovering several known solutions and outperforming strong baselines when optimal auctions are not known. In addition to maximizing revenue, auction mechanisms may also seek to encourage socially desirable constraints such as allocation fairness or diversity. However, these philosophical notions neither have standardization nor do they have widely accepted formal definitions. In this paper, we propose PreferenceNet, an extension of existing neural-network-based auction mechanisms to encode constraints using (potentially human-provided) exemplars of desirable allocations. In addition, we introduce a new metric to evaluate an auction allocations' adherence to such socially desirable constraints and demonstrate that our proposed method is competitive with current state-of-the-art neural-network based auction designs. We validate our approach through human subject research and show that we are able to effectively capture real human preferences. Our code is available at https://github.com/neeharperi/PreferenceNet △ Less

Submitted 17 October, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: This work has been accepted to Neural Information Processing Systems (NeurIPS) 2021. First two authors contributed equally

arXiv:2010.06398 [pdf, other]

ProportionNet: Balancing Fairness and Revenue for Auction Design with Deep Learning

Authors: Kevin Kuo, Anthony Ostuni, Elizabeth Horishny, Michael J. Curry, Samuel Dooley, **-yeh Chiang, Tom Goldstein, John P. Dickerson

Abstract: The design of revenue-maximizing auctions with strong incentive guarantees is a core concern of economic theory. Computational auctions enable online advertising, sourcing, spectrum allocation, and myriad financial markets. Analytic progress in this space is notoriously difficult; since Myerson's 1981 work characterizing single-item optimal auctions, there has been limited progress outside of rest… ▽ More The design of revenue-maximizing auctions with strong incentive guarantees is a core concern of economic theory. Computational auctions enable online advertising, sourcing, spectrum allocation, and myriad financial markets. Analytic progress in this space is notoriously difficult; since Myerson's 1981 work characterizing single-item optimal auctions, there has been limited progress outside of restricted settings. A recent paper by Dütting et al. circumvents analytic difficulties by applying deep learning techniques to, instead, approximate optimal auctions. In parallel, new research from Ilvento et al. and other groups has developed notions of fairness in the context of auction design. Inspired by these advances, in this paper, we extend techniques for approximating auctions using deep learning to address concerns of fairness while maintaining high revenue and strong incentive guarantees. △ Less

Submitted 13 October, 2020; originally announced October 2020.

arXiv:2009.11867 [pdf, other]

The Affiliate Matching Problem: On Labor Markets where Firms are Also Interested in the Placement of Previous Workers

Authors: Samuel Dooley, John P. Dickerson

Abstract: In many labor markets, workers and firms are connected via affiliative relationships. A management consulting firm wishes to both accept the best new workers but also place its current affiliated workers at strong firms. Similarly, a research university wishes to hire strong job market candidates while also placing its own candidates at strong peer universities. We model this affiliate matching pr… ▽ More In many labor markets, workers and firms are connected via affiliative relationships. A management consulting firm wishes to both accept the best new workers but also place its current affiliated workers at strong firms. Similarly, a research university wishes to hire strong job market candidates while also placing its own candidates at strong peer universities. We model this affiliate matching problem in a generalization of the classic stable marriage setting by permitting firms to state preferences over not just which workers to whom they are matched, but also to which firms their affiliated workers are matched. Based on results from a human survey, we find that participants (acting as firms) give preference to their own affiliate workers in surprising ways that violate some assumptions of the classical stable marriage problem. This motivates a nuanced discussion of how stability could be defined in affiliate matching problems; we give an example of a marketplace which admits a stable match under one natural definition of stability, and does not for that same marketplace under a different, but still natural, definition. We conclude by setting a research agenda toward the creation of a centralized clearing mechanism in this general setting. △ Less

Submitted 23 September, 2020; originally announced September 2020.

arXiv:2006.12621 [pdf, other]

Fairness Through Robustness: Investigating Robustness Disparity in Deep Learning

Authors: Vedant Nanda, Samuel Dooley, Sahil Singla, Soheil Feizi, John P. Dickerson

Abstract: Deep neural networks (DNNs) are increasingly used in real-world applications (e.g. facial recognition). This has resulted in concerns about the fairness of decisions made by these models. Various notions and measures of fairness have been proposed to ensure that a decision-making system does not disproportionately harm (or benefit) particular subgroups of the population. In this paper, we argue th… ▽ More Deep neural networks (DNNs) are increasingly used in real-world applications (e.g. facial recognition). This has resulted in concerns about the fairness of decisions made by these models. Various notions and measures of fairness have been proposed to ensure that a decision-making system does not disproportionately harm (or benefit) particular subgroups of the population. In this paper, we argue that traditional notions of fairness that are only based on models' outputs are not sufficient when the model is vulnerable to adversarial attacks. We argue that in some cases, it may be easier for an attacker to target a particular subgroup, resulting in a form of \textit{robustness bias}. We show that measuring robustness bias is a challenging task for DNNs and propose two methods to measure this form of bias. We then conduct an empirical study on state-of-the-art neural networks on commonly used real-world datasets such as CIFAR-10, CIFAR-100, Adience, and UTKFace and show that in almost all cases there are subgroups (in some cases based on sensitive attributes like race, gender, etc) which are less robust and are thus at a disadvantage. We argue that this kind of bias arises due to both the data distribution and the highly complex nature of the learned decision boundary in the case of DNNs, thus making mitigation of such biases a non-trivial task. Our results show that robustness bias is an important criterion to consider while auditing real-world systems that rely on DNNs for decision making. Code to reproduce all our results can be found here: \url{https://github.com/nvedant07/Fairness-Through-Robustness} △ Less

Submitted 21 January, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2021

arXiv:2001.09742 [pdf, ps, other]

Can an Algorithm be My Healthcare Proxy?

Authors: Duncan C McElfresh, Samuel Dooley, Yuan Cui, Kendra Griesman, Weiqin Wang, Tyler Will, Neil Sehgal, John P Dickerson

Abstract: Planning for death is not a process in which everyone participates. Yet a lack of planning can have vast impacts on a patient's well-being, the well-being of her family, and the medical community as a whole. Advance Care Planning (ACP) has been a field in the United States for a half-century. Many modern techniques prompting patients to think about end of life (EOL) involve short surveys or questi… ▽ More Planning for death is not a process in which everyone participates. Yet a lack of planning can have vast impacts on a patient's well-being, the well-being of her family, and the medical community as a whole. Advance Care Planning (ACP) has been a field in the United States for a half-century. Many modern techniques prompting patients to think about end of life (EOL) involve short surveys or questionnaires. Different surveys are targeted to different populations (based off of likely disease progression or cultural factors, for instance), are designed with different intentions, and are administered in different ways. There has been recent work using technology to increase the number of people using advance care planning tools. However, modern techniques from machine learning and artificial intelligence could be employed to make additional changes to the current ACP process. In this paper we will discuss some possible ways in which these tools could be applied. We will discuss possible implications of these applications through vignettes of patient scenarios. We hope that this paper will encourage thought about appropriate applications of artificial intelligence in ACP as well as implementation of AI in order to ensure intentions are honored. △ Less

Submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted for a poster presentation at the 4th International Workshop on Health Intelligence (W3PHIAI-20), colocated with AAAI 2020

arXiv:1808.02443 [pdf, other]

Overhead Detection: Beyond 8-bits and RGB

Authors: Eliza Mace, Keith Manville, Monica Barbu-McInnis, Michael Laielli, Matthew Klaric, Samuel Dooley

Abstract: This study uses the challenging and publicly available SpaceNet dataset to establish a performance baseline for a state-of-the-art object detector in satellite imagery. Specifically, we examine how various features of the data affect building detection accuracy with respect to the Intersection over Union metric. We demonstrate that the performance of the R-FCN detection algorithm on imagery with a… ▽ More This study uses the challenging and publicly available SpaceNet dataset to establish a performance baseline for a state-of-the-art object detector in satellite imagery. Specifically, we examine how various features of the data affect building detection accuracy with respect to the Intersection over Union metric. We demonstrate that the performance of the R-FCN detection algorithm on imagery with a 1.5 meter ground sample distance and three spectral bands increases by over 32% by using 13-bit data, as opposed to 8-bit data at the same spatial and spectral resolution. We also establish accuracy trends with respect to building size and scene density. Finally, we propose and evaluate multiple methods for integrating additional spectral information into off-the-shelf deep learning architectures. Interestingly, our methods are robust to the choice of spectral bands and we note no significant performance improvement when adding additional bands. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: 10 pages, 8 figures, 2 tables

Journal ref: Naval Applications of Machine Learning, February 13, 2018

arXiv:1802.07856 [pdf, other]

xView: Objects in Context in Overhead Imagery

Authors: Darius Lam, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, Brendan McCord

Abstract: We introduce a new large-scale dataset for the advancement of object detection techniques and overhead object detection research. This satellite imagery dataset enables research progress pertaining to four key computer vision frontiers. We utilize a novel process for geospatial category detection and bounding box annotation with three stages of quality control. Our data is collected from WorldView… ▽ More We introduce a new large-scale dataset for the advancement of object detection techniques and overhead object detection research. This satellite imagery dataset enables research progress pertaining to four key computer vision frontiers. We utilize a novel process for geospatial category detection and bounding box annotation with three stages of quality control. Our data is collected from WorldView-3 satellites at 0.3m ground sample distance, providing higher resolution imagery than most public satellite imagery datasets. We compare xView to other object detection datasets in both natural and overhead imagery domains and then provide a baseline analysis using the Single Shot MultiBox Detector. xView is one of the largest and most diverse publicly available object-detection datasets to date, with over 1 million objects across 60 classes in over 1,400 km^2 of imagery. △ Less

Submitted 21 February, 2018; originally announced February 2018.

Comments: Initial submission

arXiv:1204.4459 [pdf]

An Interference-Aware Virtual Clustering Paradigm for Resource Management in Cognitive Femtocell Networks

Authors: Faisal Tariq, Laurence S. Dooley, Adrian S. Poulton

Abstract: Femtocells represent a promising alternative solution for high quality wireless access in indoor scenarios where conventional cellular system coverage can be poor. Femtocell access points (FAP) are normally randomly deployed by the end user, so only post deployment network planning is possible. Furthermore, this uncoordinated deployment creates the potential for severe interference to co-located f… ▽ More Femtocells represent a promising alternative solution for high quality wireless access in indoor scenarios where conventional cellular system coverage can be poor. Femtocell access points (FAP) are normally randomly deployed by the end user, so only post deployment network planning is possible. Furthermore, this uncoordinated deployment creates the potential for severe interference to co-located femtocells, especially in dense deployments. This paper presents a new femtocell network architecture using a generalized virtual cluster femtocell (GVCF) paradigm, which groups together FAP, which are allocated to the same femtocell gateway (FGW), into logical clusters. This guarantees severely interfering and overlap** femtocells are assigned to different clusters, and since each cluster operates on a different band of frequencies, the corresponding virtual cluster controller only has to manage its own FAP members, so the overall system complexity is low. The performance of the GVCF algorithm is analysed from both a resource availability and cluster number perspective, and a novel strategy is proposed for dynamically adapting these to network environment changes, while upholding quality-of-service requirements. Simulation results conclusively corroborate the superior performance of the GVCF model in interference mitigation, particularly in high density FAP scenarios. △ Less

Submitted 19 April, 2012; originally announced April 2012.

Showing 1–28 of 28 results for author: Dooley, S