Search | arXiv e-print repository

Lessons from the Trenches on Reproducible Evaluation of Language Models

Authors: Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan , et al. (5 additional authors not shown)

Abstract: Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons… ▽ More Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the Language Model Evaluation Harness (lm-eval): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns. △ Less

Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.04491 [pdf, other]

doi 10.1017/pasa.2024.32

Galaxy 3D Shape Recovery using Mixture Density Network

Authors: Suk Yee Yong, K. E. Harborne, Caroline Foster, Robert Bassett, Gregory B. Poole, Mitchell Cavanagh

Abstract: Since the turn of the century, astronomers have been exploiting the rich information afforded by combining stellar kinematic maps and imaging in an attempt to recover the intrinsic, three-dimensional (3D) shape of a galaxy. A common intrinsic shape recovery method relies on an expected monotonic relationship between the intrinsic misalignment of the kinematic and morphological axes and the triaxia… ▽ More Since the turn of the century, astronomers have been exploiting the rich information afforded by combining stellar kinematic maps and imaging in an attempt to recover the intrinsic, three-dimensional (3D) shape of a galaxy. A common intrinsic shape recovery method relies on an expected monotonic relationship between the intrinsic misalignment of the kinematic and morphological axes and the triaxiality parameter. Recent studies have, however, cast doubt about underlying assumptions relating shape and intrinsic kinematic misalignment. In this work, we aim to recover the 3D shape of individual galaxies using their projected stellar kinematic and flux distributions using a supervised machine learning approach with mixture density network (MDN). Using a mock dataset of the EAGLE hydrodynamical cosmological simulation, we train the MDN model for a carefully selected set of common kinematic and photometric parameters. Compared to previous methods, we demonstrate potential improvements achieved with the MDN model to retrieve the 3D galaxy shape along with the uncertainties, especially for prolate and triaxial systems. We make specific recommendations for recovering galaxy intrinsic shapes relevant for current and future integral field spectroscopic galaxy surveys. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: Accepted for publication in PASA. 18 pages, 12 figures, 2 tables

Journal ref: Publ. Astron. Soc. Aust. 41 (2024) e033

arXiv:2403.06557 [pdf, other]

Data-driven architecture to encode information in the kinematics of robots and artificial avatars

Authors: Francesco De Lellis, Marco Coraggio, Nathan C. Foster, Riccardo Villa, Cristina Becchio, Mario di Bernardo

Abstract: We present a data-driven control architecture for modifying the kinematics of robots and artificial avatars to encode specific information such as the presence or not of an emotion in the movements of an avatar or robot driven by a human operator. We validate our approach on an experimental dataset obtained during the reach-to-grasp phase of a pick-and-place task. We present a data-driven control architecture for modifying the kinematics of robots and artificial avatars to encode specific information such as the presence or not of an emotion in the movements of an avatar or robot driven by a human operator. We validate our approach on an experimental dataset obtained during the reach-to-grasp phase of a pick-and-place task. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.02846 [pdf, other]

Machine Learning Resistant Amorphous Silicon Physically Unclonable Functions (PUFs)

Authors: Velat Kilic, Neil Macfarlane, Jasper Stround, Samuel Metais, Milad Alemohammad, A. Brinton Cooper, Amy C. Foster, Mark A. Foster

Abstract: We investigate usage of nonlinear wave chaotic amorphous silicon (a-Si) cavities as physically unclonable functions (PUF). Machine learning attacks on integrated electronic PUFs have been demonstrated to be very effective at modeling PUF behavior. Such attacks on integrated a-Si photonic PUFs are investigated through application of algorithms including linear regression, k-nearest neighbor, decisi… ▽ More We investigate usage of nonlinear wave chaotic amorphous silicon (a-Si) cavities as physically unclonable functions (PUF). Machine learning attacks on integrated electronic PUFs have been demonstrated to be very effective at modeling PUF behavior. Such attacks on integrated a-Si photonic PUFs are investigated through application of algorithms including linear regression, k-nearest neighbor, decision tree ensembles (random forests and gradient boosted trees), and deep neural networks (DNNs). We found that DNNs performed the best among all the algorithms studied but still failed to completely break the a-Si PUF security which we quantify through a private information metric. Furthermore, machine learning resistance of a-Si PUFs were found to be directly related to the strength of their nonlinear response. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2208.10022 [pdf, other]

Generalized Relative Neighborhood Graph (GRNG) for Similarity Search

Authors: Cole Foster, Berk Sevilmis, Benjamin Kimia

Abstract: Similarity search is a fundamental building block for information retrieval on a variety of datasets. The notion of a neighbor is often based on binary considerations, such as the k nearest neighbors. However, considering that data is often organized as a manifold with low intrinsic dimension, the notion of a neighbor must recognize higher-order relationship, to capture neighbors in all directions… ▽ More Similarity search is a fundamental building block for information retrieval on a variety of datasets. The notion of a neighbor is often based on binary considerations, such as the k nearest neighbors. However, considering that data is often organized as a manifold with low intrinsic dimension, the notion of a neighbor must recognize higher-order relationship, to capture neighbors in all directions. Proximity graphs, such as the Relative Neighbor Graphs (RNG), use trinary relationships which capture the notion of direction and have been successfully used in a number of applications. However, the current algorithms for computing the RNG, despite widespread use, are approximate and not scalable. This paper proposes a novel type of graph, the Generalized Relative Neighborhood Graph (GRNG) for use in a pivot layer that then guides the efficient and exact construction of the RNG of a set of exemplars. It also shows how to extend this to a multi-layer hierarchy which significantly improves over the state-of-the-art methods which can only construct an approximate RNG. △ Less

Submitted 21 August, 2022; originally announced August 2022.

arXiv:2108.09781 [pdf]

Global Transfers: M-Pesa, Intellectual Property Rights and Digital Innovation

Authors: Christopher Foster

Abstract: In July 2020, in the midst of the COVID crisis, the Kenyan mobile operator Safaricom announced that the intellectual property rights (IPR) for mobile money service M-Pesa were "moving back into African control". This paper tracks how the IPR originally came to be held outside Kenya, and the implications for understanding M-Pesa as an inclusive innovation. Through reflection of this analysis of IPR… ▽ More In July 2020, in the midst of the COVID crisis, the Kenyan mobile operator Safaricom announced that the intellectual property rights (IPR) for mobile money service M-Pesa were "moving back into African control". This paper tracks how the IPR originally came to be held outside Kenya, and the implications for understanding M-Pesa as an inclusive innovation. Through reflection of this analysis of IPR and innovation, the paper contributes to discussions on structural aspects of digital innovation in the global south. By focussing on IPR, it unpacks some of the processes by which global intellectual property regimes and cross-border IPR practices shape uneven outcomes and power. △ Less

Submitted 22 August, 2021; originally announced August 2021.

Comments: In proceedings of the 1st Virtual Conference on Implications of Information and Digital Technologies for Development, 2021

arXiv:2108.05025 [pdf, other]

doi 10.1145/3462244.3479923

Learning Oculomotor Behaviors from Scanpath

Authors: Beibin Li, Nicholas Nuechterlein, Erin Barney, Claire Foster, Minah Kim, Monique Mahony, Adham Atyabi, Li Feng, Quan Wang, Pamela Ventola, Linda Shapiro, Frederick Shic

Abstract: Identifying oculomotor behaviors relevant for eye-tracking applications is a critical but often challenging task. Aiming to automatically learn and extract knowledge from existing eye-tracking data, we develop a novel method that creates rich representations of oculomotor scanpaths to facilitate the learning of downstream tasks. The proposed stimulus-agnostic Oculomotor Behavior Framework (OBF) mo… ▽ More Identifying oculomotor behaviors relevant for eye-tracking applications is a critical but often challenging task. Aiming to automatically learn and extract knowledge from existing eye-tracking data, we develop a novel method that creates rich representations of oculomotor scanpaths to facilitate the learning of downstream tasks. The proposed stimulus-agnostic Oculomotor Behavior Framework (OBF) model learns human oculomotor behaviors from unsupervised and semi-supervised tasks, including reconstruction, predictive coding, fixation identification, and contrastive learning tasks. The resultant pre-trained OBF model can be used in a variety of applications. Our pre-trained model outperforms baseline approaches and traditional scanpath methods in autism spectrum disorder and viewed-stimulus classification tasks. Ablation experiments further show our proposed method could achieve even better results with larger model sizes and more diverse eye-tracking training datasets, supporting the model's potential for future eye-tracking applications. Open source code: http://github.com/BeibinLi/OBF. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: Accepted ACM ICMI 2021

arXiv:2101.00027 [pdf, other]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Authors: Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy

Abstract: Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and new… ▽ More Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. △ Less

Submitted 31 December, 2020; originally announced January 2021.

arXiv:1911.11069 [pdf]

Query Expansion for Patent Searching using Word Embedding and Professional Crowdsourcing

Authors: Arthi Krishna, Ye **, Christine Foster, Greg Gabel, Britt Hanley, Abdou Youssef

Abstract: The patent examination process includes a search of previous work to verify that a patent application describes a novel invention. Patent examiners primarily use keyword-based searches to uncover prior art. A critical part of keyword searching is query expansion, which is the process of including alternate terms such as synonyms and other related words, since the same concepts are often described… ▽ More The patent examination process includes a search of previous work to verify that a patent application describes a novel invention. Patent examiners primarily use keyword-based searches to uncover prior art. A critical part of keyword searching is query expansion, which is the process of including alternate terms such as synonyms and other related words, since the same concepts are often described differently in the literature. Patent terminology is often domain specific. By curating technology-specific corpora and training word embedding models based on these corpora, we are able to automatically identify the most relevant expansions of a given word or phrase. We compare the performance of several automated query expansion techniques against expert specified expansions. Furthermore, we explore a novel mechanism to extract related terms not just based on one input term but several terms in conjunction by computing their centroid and identifying the nearest neighbors to this centroid. Highly skilled patent examiners are often the best and most reliable source of identifying related terms. By designing a user interface that allows examiners to interact with the word embedding suggestions, we are able to use these interactions to power crowdsourced modes of related terms. Learning from users allows us to overcome several challenges such as identifying words that are bleeding edge and have not been published in the corpus yet. This paper studies the effectiveness of word embedding and crowdsourced models across 11 disparate technical areas. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: Presented at AAAI FSS-19: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA

arXiv:1904.03616 [pdf, other]

A Facial Affect Analysis System for Autism Spectrum Disorder

Authors: Beibin Li, Sachin Mehta, Deepali Aneja, Claire Foster, Pamela Ventola, Frederick Shic, Linda Shapiro

Abstract: In this paper, we introduce an end-to-end machine learning-based system for classifying autism spectrum disorder (ASD) using facial attributes such as expressions, action units, arousal, and valence. Our system classifies ASD using representations of different facial attributes from convolutional neural networks, which are trained on images in the wild. Our experimental results show that different… ▽ More In this paper, we introduce an end-to-end machine learning-based system for classifying autism spectrum disorder (ASD) using facial attributes such as expressions, action units, arousal, and valence. Our system classifies ASD using representations of different facial attributes from convolutional neural networks, which are trained on images in the wild. Our experimental results show that different facial attributes used in our system are statistically significant and improve sensitivity, specificity, and F1 score of ASD classification by a large margin. In particular, the addition of different facial attributes improves the performance of ASD classification by about 7% which achieves a F1 score of 76%. △ Less

Submitted 7 April, 2019; originally announced April 2019.

Comments: 5 pages (including 1 page for reference), 3 figures

arXiv:1711.02222 [pdf]

Information-Dense Nonlinear Photonic Physical Unclonable Function

Authors: Brian C. Grubel, Bryan T. Bosworth, Michael R. Kossey, A. Brinton Cooper, Mark A. Foster, Amy C. Foster

Abstract: We present a comprehensive investigation into the complexity of a new private key storage apparatus: a novel silicon photonic physical unclonable function (PUF) based on ultrafast nonlinear optical interactions in a chaotic silicon microcavity that is both unclonable and impossible to emulate. This device provides remarkable improvements to total information content (raw cryptographic material), i… ▽ More We present a comprehensive investigation into the complexity of a new private key storage apparatus: a novel silicon photonic physical unclonable function (PUF) based on ultrafast nonlinear optical interactions in a chaotic silicon microcavity that is both unclonable and impossible to emulate. This device provides remarkable improvements to total information content (raw cryptographic material), information density, and key generation rates over existing optical scattering PUFs and is also more easily integrated with both CMOS electronics and telecommunications hardware. Our device exploits the natural nonlinear optical behavior of silicon to neutralize commonly used attacks against PUFs and vastly enhance device complexity. We confirm this phenomenon with thorough experimental results on prototype devices and present a detailed estimate of their total information content. Our compact, micron-scale approach represents an entirely new generation of ultrafast and high information density photonic PUF devices that can be directly incorporated into integrated circuits to ensure authenticity and provide secure physical storage of private key material. △ Less

Submitted 6 November, 2017; originally announced November 2017.

arXiv:1711.01439 [pdf]

doi 10.1364/OE.26.004710

Secure Communications using Nonlinear Silicon Photonic Keys

Authors: Brian C. Grubel, Bryan T. Bosworth, Michael R. Kossey, A. Brinton Cooper, Mark A. Foster, Amy C. Foster

Abstract: We present a secure communication system constructed using pairs of nonlinear photonic physical unclonable functions (PUFs) that harness physical chaos in integrated silicon micro-cavities. Compared to a large, electronically stored one-time pad, our method provisions large amounts of information within the intrinsically complex nanostructure of the micro-cavities. By probing a micro-cavity with a… ▽ More We present a secure communication system constructed using pairs of nonlinear photonic physical unclonable functions (PUFs) that harness physical chaos in integrated silicon micro-cavities. Compared to a large, electronically stored one-time pad, our method provisions large amounts of information within the intrinsically complex nanostructure of the micro-cavities. By probing a micro-cavity with a rapid sequence of spectrally-encoded ultrafast optical pulses and measuring the lightwave responses, we experimentally demonstrate the ability to extract 2.4 Gb of key material from a single micro-cavity device. Subsequently, in a secure communications experiment with pairs of devices, we achieve bit error rates below $10^{-5}$ at code rates of up to 0.1. The PUFs' responses are never transmitted over the channel or stored in digital memory, thus enhancing security of the system. Additionally, the micro-cavity PUFs are extremely small, inexpensive, robust, and fully compatible with telecommunications infrastructure, components, and electronic fabrication. This approach can serve one-time pad or public key exchange applications where high security is required △ Less

Submitted 5 February, 2018; v1 submitted 4 November, 2017; originally announced November 2017.

Comments: 12 pages. Replaced with revised version

arXiv:1611.06962 [pdf, other]

Sampled Image Tagging and Retrieval Methods on User Generated Content

Authors: Karl Ni, Kyle Zaragoza, Charles Foster, Carmen Carrano, Barry Chen, Yonas Tesfaye, Alex Gude

Abstract: Traditional image tagging and retrieval algorithms have limited value as a result of being trained with heavily curated datasets. These limitations are most evident when arbitrary search words are used that do not intersect with training set labels. Weak labels from user generated content (UGC) found in the wild (e.g., Google Photos, FlickR, etc.) have an almost unlimited number of unique words in… ▽ More Traditional image tagging and retrieval algorithms have limited value as a result of being trained with heavily curated datasets. These limitations are most evident when arbitrary search words are used that do not intersect with training set labels. Weak labels from user generated content (UGC) found in the wild (e.g., Google Photos, FlickR, etc.) have an almost unlimited number of unique words in the metadata tags. Prior work on word embeddings successfully leveraged unstructured text with large vocabularies, and our proposed method seeks to apply similar cost functions to open source imagery. Specifically, we train a deep learning image tagging and retrieval system on large scale, user generated content (UGC) using sampling methods and joint optimization of word embeddings. By using the Yahoo! FlickR Creative Commons (YFCC100M) dataset, such an approach builds robustness to common unstructured data issues that include but are not limited to irrelevant tags, misspellings, multiple languages, polysemy, and tag imbalance. As a result, the final proposed algorithm will not only yield comparable results to state of the art in conventional image tagging, but will enable new capability to train algorithms on large, scale unstructured text in the YFCC100M dataset and outperform cited work in zero-shot capability. △ Less

Submitted 2 December, 2016; v1 submitted 21 November, 2016; originally announced November 2016.

Showing 1–13 of 13 results for author: Foster, C