Search | arXiv e-print repository

OLMES: A Standard for Language Model Evaluations

Authors: Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, Hannaneh Hajishirzi

Abstract: Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to… ▽ More Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In develo** this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2208.02126 [pdf, other]

Noise tolerance of learning to rank under class-conditional label noise

Authors: Dany Haddad

Abstract: Often, the data used to train ranking models is subject to label noise. For example, in web-search, labels created from clickstream data are noisy due to issues such as insufficient information in item descriptions on the SERP, query reformulation by the user, and erratic or unexpected user behavior. In practice, it is difficult to handle label noise without making strong assumptions about the lab… ▽ More Often, the data used to train ranking models is subject to label noise. For example, in web-search, labels created from clickstream data are noisy due to issues such as insufficient information in item descriptions on the SERP, query reformulation by the user, and erratic or unexpected user behavior. In practice, it is difficult to handle label noise without making strong assumptions about the label generation process. As a result, practitioners typically train their learning-to-rank (LtR) models directly on this noisy data without additional consideration of the label noise. Surprisingly, we often see strong performance from LtR models trained in this way. In this work, we describe a class of noise-tolerant LtR losses for which empirical risk minimization is a consistent procedure, even in the context of class-conditional label noise. We also develop noise-tolerant analogs of commonly used loss functions. The practical implications of our theoretical findings are further supported by experimental results. △ Less

Submitted 17 August, 2022; v1 submitted 3 August, 2022; originally announced August 2022.

arXiv:2206.03612 [pdf, other]

Predictive Modeling of Charge Levels for Battery Electric Vehicles using CNN EfficientNet and IGTD Algorithm

Authors: Seongwoo Choi, Chongzhou Fang, David Haddad, Minsung Kim

Abstract: Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning a… ▽ More Convolutional Neural Networks (CNN) have been a good solution for understanding a vast image dataset. As the increased number of battery-equipped electric vehicles is flourishing globally, there has been much research on understanding which charge levels electric vehicle drivers would choose to charge their vehicles to get to their destination without any prevention. We implemented deep learning approaches to analyze the tabular datasets to understand their state of charge and which charge levels they would choose. In addition, we implemented the Image Generator for Tabular Dataset algorithm to utilize tabular datasets as image datasets to train convolutional neural networks. Also, we integrated other CNN architecture such as EfficientNet to prove that CNN is a great learner for reading information from images that were converted from the tabular dataset, and able to predict charge levels for battery-equipped electric vehicles. We also evaluated several optimization methods to enhance the learning rate of the models and examined further analysis on improving the model architecture. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2203.15060 [pdf]

A Deep Learning Technique using a Sequence of Follow Up X-Rays for Disease classification

Authors: Sairamvinay Vijayaraghavan, David Haddad, Shikun Huang, Seongwoo Choi

Abstract: The ability to predict lung and heart based diseases using deep learning techniques is central to many researchers, particularly in the medical field around the world. In this paper, we present a unique outlook of a very familiar problem of disease classification using X-rays. We present a hypothesis that X-rays of patients included with the follow up history of their most recent three chest X-ray… ▽ More The ability to predict lung and heart based diseases using deep learning techniques is central to many researchers, particularly in the medical field around the world. In this paper, we present a unique outlook of a very familiar problem of disease classification using X-rays. We present a hypothesis that X-rays of patients included with the follow up history of their most recent three chest X-ray images would perform better in disease classification in comparison to one chest X-ray image input using an internal CNN to perform feature extraction. We have discovered that our generic deep learning architecture which we propose for solving this problem performs well with 3 input X ray images provided per sample for each patient. In this paper, we have also established that without additional layers before the output classification, the CNN models will improve the performance of predicting the disease labels for each patient. We have provided our results in ROC curves and AUROC scores. We define a fresh approach of collecting three X-ray images for training deep learning models, which we have concluded has clearly improved the performance of the models. We have shown that ResNet, in general, has a better result than any other CNN model used in the feature extraction phase. With our original approach to data pre-processing, image training, and pre-trained models, we believe that the current research will assist many medical institutions around the world, and this will improve the prediction of patients' symptoms and diagnose them with more accurate cure. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: 13 pages

arXiv:2007.12824 [pdf, other]

doi 10.1109/TG.2021.3082909

Multi-Armed Bandits for Minesweeper: Profiting from Exploration-Exploitation Synergy

Authors: Igor Q. Lordeiro, Diego B. Haddad, Douglas O. Cardoso

Abstract: A popular computer puzzle, the game of Minesweeper requires its human players to have a mix of both luck and strategy to succeed. Analyzing these aspects more formally, in our research we assessed the feasibility of a novel methodology based on Reinforcement Learning as an adequate approach to tackle the problem presented by this game. For this purpose we employed Multi-Armed Bandit algorithms whi… ▽ More A popular computer puzzle, the game of Minesweeper requires its human players to have a mix of both luck and strategy to succeed. Analyzing these aspects more formally, in our research we assessed the feasibility of a novel methodology based on Reinforcement Learning as an adequate approach to tackle the problem presented by this game. For this purpose we employed Multi-Armed Bandit algorithms which were carefully adapted in order to enable their use to define autonomous computational players, targeting to make the best use of some game peculiarities. After experimental evaluation, results showed that this approach was indeed successful, especially in smaller game boards, such as the standard beginner level. Despite this fact the main contribution of this work is a detailed examination of Minesweeper from a learning perspective, which led to various original insights which are thoroughly discussed. △ Less

Submitted 17 June, 2021; v1 submitted 24 July, 2020; originally announced July 2020.

Comments: To be published in IEEE Transactions on Games (ISSN 2475-1510 / 2475-1502)

arXiv:2004.06916 [pdf, other]

doi 10.1186/s13362-020-00098-w

Flattening the curves: on-off lock-down strategies for COVID-19 with an application to Brazi

Authors: L. Tarrataca, C. M. Dias, D. B. Haddad, E. F. Arruda

Abstract: The current COVID-19 pandemic is affecting different countries in different ways. The assortment of reporting techniques alongside other issues, such as underreporting and budgetary constraints, makes predicting the spread and lethality of the virus a challenging task. This work attempts to gain a better understanding of how COVID-19 will affect one of the least studied countries, namely Brazil. C… ▽ More The current COVID-19 pandemic is affecting different countries in different ways. The assortment of reporting techniques alongside other issues, such as underreporting and budgetary constraints, makes predicting the spread and lethality of the virus a challenging task. This work attempts to gain a better understanding of how COVID-19 will affect one of the least studied countries, namely Brazil. Currently, several Brazilian states are in a state of lock-down. However, there is political pressure for this type of measures to be lifted. This work considers the impact that such a termination would have on how the virus evolves locally. This was done by extending the SEIR model with an on / off strategy. Given the simplicity of SEIR we also attempted to gain more insight by develo** a neural regressor. We chose to employ features that current clinical studies have pinpointed has having a connection to the lethality of COVID-19. We discuss how this data can be processed in order to obtain a robust assessment. △ Less

Submitted 15 April, 2020; originally announced April 2020.

arXiv:1907.08657 [pdf, other]

doi 10.1145/3331184.3331272

Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval

Authors: Dany Haddad, Joydeep Ghosh

Abstract: The limited availability of ground truth relevance labels has been a major impediment to the application of supervised methods to ad-hoc retrieval. As a result, unsupervised scoring methods, such as BM25, remain strong competitors to deep learning techniques which have brought on dramatic improvements in other domains, such as computer vision and natural language processing. Recent works have show… ▽ More The limited availability of ground truth relevance labels has been a major impediment to the application of supervised methods to ad-hoc retrieval. As a result, unsupervised scoring methods, such as BM25, remain strong competitors to deep learning techniques which have brought on dramatic improvements in other domains, such as computer vision and natural language processing. Recent works have shown that it is possible to take advantage of the performance of these unsupervised methods to generate training data for learning-to-rank models. The key limitation to this line of work is the size of the training set required to surpass the performance of the original unsupervised method, which can be as large as $10^{13}$ training examples. Building on these insights, we propose two methods to reduce the amount of training data required. The first method takes inspiration from crowdsourcing, and leverages multiple unsupervised rankers to generate soft, or noise-aware, training labels. The second identifies harmful, or mislabeled, training examples and removes them from the training set. We show that our methods allow us to surpass the performance of the unsupervised baseline with far fewer training examples than previous works. △ Less

Submitted 19 July, 2019; originally announced July 2019.

Comments: SIGIR 2019

Showing 1–7 of 7 results for author: Haddad, D