Search | arXiv e-print repository

An Automated SQL Query Grading System Using An Attention-Based Convolutional Neural Network

Authors: Donald R. Schwartz, Pablo Rivas

Abstract: Grading SQL queries can be a time-consuming, tedious and challenging task, especially as the number of student submissions increases. Several systems have been introduced in an attempt to mitigate these challenges, but those systems have their own limitations. This paper describes our novel approach to automating the process of grading SQL queries. Unlike previous approaches, we employ a unique co… ▽ More Grading SQL queries can be a time-consuming, tedious and challenging task, especially as the number of student submissions increases. Several systems have been introduced in an attempt to mitigate these challenges, but those systems have their own limitations. This paper describes our novel approach to automating the process of grading SQL queries. Unlike previous approaches, we employ a unique convolutional neural network architecture that employs a parameter-sharing approach for different machine learning tasks that enables the architecture to induce different knowledge representations of the data to increase its potential for understanding SQL statements. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 12 pages, 8 figures, paper accepted at "The 18th International Conference on Frontiers in Education: Computer Science and Computer Engineering"

ACM Class: I.2.6; H.2.3; K.3.2

arXiv:2406.06386 [pdf, other]

FPN-IAIA-BL: A Multi-Scale Interpretable Deep Learning Model for Classification of Mass Margins in Digital Mammography

Authors: Julia Yang, Alina Jade Barnett, Jon Donnelly, Satvik Kishore, Jerry Fang, Fides Regina Schwartz, Chaofan Chen, Joseph Y. Lo, Cynthia Rudin

Abstract: Digital mammography is essential to breast cancer detection, and deep learning offers promising tools for faster and more accurate mammogram analysis. In radiology and other high-stakes environments, uninterpretable ("black box") deep learning models are unsuitable and there is a call in these fields to make interpretable models. Recent work in interpretable computer vision provides transparency t… ▽ More Digital mammography is essential to breast cancer detection, and deep learning offers promising tools for faster and more accurate mammogram analysis. In radiology and other high-stakes environments, uninterpretable ("black box") deep learning models are unsuitable and there is a call in these fields to make interpretable models. Recent work in interpretable computer vision provides transparency to these formerly black boxes by utilizing prototypes for case-based explanations, achieving high accuracy in applications including mammography. However, these models struggle with precise feature localization, reasoning on large portions of an image when only a small part is relevant. This paper addresses this gap by proposing a novel multi-scale interpretable deep learning model for mammographic mass margin classification. Our contribution not only offers an interpretable model with reasoning aligned with radiologist practices, but also provides a general architecture for computer vision with user-configurable prototypes from coarse- to fine-grained prototypes. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 8 pages, 6 figures, Accepted for oral presentation at the 2024 CVPR Workshop on Domain adaptation, Explainability, Fairness in AI for Medical Image Analysis (DEF-AI-MIA)

arXiv:2405.06563 [pdf, other]

What Can Natural Language Processing Do for Peer Review?

Authors: Ilia Kuznetsov, Osama Mohammed Afzal, Koen Dercksen, Nils Dycke, Alexander Goldberg, Tom Hope, Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, **gyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Shah, Iryna Gurevych

Abstract: The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time… ▽ More The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.04304 [pdf, other]

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Authors: Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

Abstract: Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Op… ▽ More Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text. △ Less

Submitted 23 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.02743 [pdf, other]

Beyond Performance: Quantifying and Mitigating Label Bias in LLMs

Authors: Yuval Reif, Roy Schwartz

Abstract: Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplo… ▽ More Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit label bias -- an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model's predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: NAACL 2024

arXiv:2404.00725 [pdf, other]

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

Authors: Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi

Abstract: It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as ru… ▽ More It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model and selecting one. Our findings reveal that, in a standard unit-test setup, the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2401.06104 [pdf, other]

Transformers are Multi-State RNNs

Authors: Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz

Abstract: Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-s… ▽ More Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size, which translates to 4.8X higher throughput. Our results shed light on the connection between transformers and RNNs, and help mitigate one of LLMs' most painful computational bottlenecks - the size of their key-value cache. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA △ Less

Submitted 18 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: preprint

arXiv:2311.15639 [pdf, other]

On Approximating Cutwidth and Pathwidth

Authors: Nikhil Bansal, Dor Katzelnick, Roy Schwartz

Abstract: We study graph ordering problems with a min-max objective. A classical problem of this type is cutwidth, where given a graph we want to order its vertices such that the number of edges crossing any point is minimized. We give a $ \log^{1+o(1)}(n)$ approximation for the problem, substantially improving upon the previous poly-logarithmic guarantees based on the standard recursive balanced partitioni… ▽ More We study graph ordering problems with a min-max objective. A classical problem of this type is cutwidth, where given a graph we want to order its vertices such that the number of edges crossing any point is minimized. We give a $ \log^{1+o(1)}(n)$ approximation for the problem, substantially improving upon the previous poly-logarithmic guarantees based on the standard recursive balanced partitioning approach of Leighton and Rao (FOCS'88). Our key idea is a new metric decomposition procedure that is suitable for handling min-max objectives, which could be of independent interest. We also use this to show other results, including an improved $ \log^{1+o(1)}(n)$ approximation for computing the pathwidth of a graph. △ Less

Submitted 12 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.02251 [pdf]

The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU)

Authors: Jessica Sena, Mohammad Tahsin Mostafiz, Jiaqing Zhang, Andrea Davidson, Sabyasachi Bandyopadhyay, Ren Yuanfang, Tezcan Ozrazgat-Baslanti, Benjamin Shickel, Tyler Loftus, William Robson Schwartz, Azra Bihorac, Parisa Rashidi

Abstract: Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level… ▽ More Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for develo** an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients. △ Less

Submitted 3 November, 2023; originally announced November 2023.

arXiv:2311.00400 [pdf, other]

Open-Set Face Recognition with Maximal Entropy and Objectosphere Loss

Authors: Rafael Henrique Vareto, Yu Linghu, Terrance E. Boult, William Robson Schwartz, Manuel Günther

Abstract: Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits… ▽ More Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits from additional negative face images when combined with distinct cost functions, such as Objectosphere Loss (OS) and the proposed Maximal Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of increasing the entropy for negative samples and attaches a penalty to known target classes in pursuance of gallery specialization. The proposed approach adopts pre-trained deep neural networks (DNNs) for face recognition as feature extractors. Then, the adapter network takes deep feature representations and acts as a substitute for the output layer of the pre-trained DNN in exchange for an agile domain adaptation. Promising results have been achieved following open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well as state-of-the-art performance when supplementary negative data is properly selected to fine-tune the adapter network. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted for publication in Image and Vision Computing 2023

arXiv:2310.18877 [pdf, other]

Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition

Authors: Isaac Slaughter, Craig Greenberg, Reva Schwartz, Aylin Caliskan

Abstract: Previous work has established that a person's demographics and speech style affect how well speech processing models perform for them. But where does this bias come from? In this work, we present the Speech Embedding Association Test (SpEAT), a method for detecting bias in one type of model used for many speech tasks: pre-trained models. The SpEAT is inspired by word embedding association tests in… ▽ More Previous work has established that a person's demographics and speech style affect how well speech processing models perform for them. But where does this bias come from? In this work, we present the Speech Embedding Association Test (SpEAT), a method for detecting bias in one type of model used for many speech tasks: pre-trained models. The SpEAT is inspired by word embedding association tests in natural language processing, which quantify intrinsic bias in a model's representations of different concepts, such as race or valence (something's pleasantness or unpleasantness) and capture the extent to which a model trained on large-scale socio-cultural data has learned human-like biases. Using the SpEAT, we test for six types of bias in 16 English speech models (including 4 models also trained on multilingual data), which come from the wav2vec 2.0, HuBERT, WavLM, and Whisper model families. We find that 14 or more models reveal positive valence (pleasantness) associations with abled people over disabled people, with European-Americans over African-Americans, with females over males, with U.S. accented speakers over non-U.S. accented speakers, and with younger people over older people. Beyond establishing that pre-trained speech models contain these biases, we also show that they can have real world effects. We compare biases found in pre-trained models to biases in downstream models adapted to the task of Speech Emotion Recognition (SER) and find that in 66 of the 96 tests performed (69%), the group that is more associated with positive valence as indicated by the SpEAT also tends to be predicted as speaking with higher valence by the downstream model. Our work provides evidence that, like text and image-based models, pre-trained speech based-models frequently learn human-like biases. Our work also shows that bias found in pre-trained models can propagate to the downstream task of SER. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2309.05088 [pdf]

Towards Trustworthy Artificial Intelligence for Equitable Global Health

Authors: Hong Qin, Jude Kong, Wandi Ding, Ramneek Ahluwalia, Christo El Morr, Zeynep Engin, Jake Okechukwu Effoduh, Rebecca Hwa, Serena **gchuan Guo, Laleh Seyyed-Kalantari, Sylvia Kiwuwa Muyingo, Candace Makeda Moore, Ravi Parikh, Reva Schwartz, Dongxiao Zhu, Xiaoqian Wang, Yiye Zhang

Abstract: Artificial intelligence (AI) can potentially transform global health, but algorithmic bias can exacerbate social inequities and disparity. Trustworthy AI entails the intentional design to ensure equity and mitigate potential biases. To advance trustworthy AI in global health, we convened a workshop on Fairness in Machine Intelligence for Global Health (FairMI4GH). The event brought together a glob… ▽ More Artificial intelligence (AI) can potentially transform global health, but algorithmic bias can exacerbate social inequities and disparity. Trustworthy AI entails the intentional design to ensure equity and mitigate potential biases. To advance trustworthy AI in global health, we convened a workshop on Fairness in Machine Intelligence for Global Health (FairMI4GH). The event brought together a global mix of experts from various disciplines, community health practitioners, policymakers, and more. Topics covered included managing AI bias in socio-technical systems, AI's potential impacts on global health, and balancing data privacy with transparency. Panel discussions examined the cultural, political, and ethical dimensions of AI in global health. FairMI4GH aimed to stimulate dialogue, facilitate knowledge transfer, and spark innovative solutions. Drawing from NIST's AI Risk Management Framework, it provided suggestions for handling AI risks and biases. The need to mitigate data biases from the research design stage, adopt a human-centered approach, and advocate for AI transparency was recognized. Challenges such as updating legal frameworks, managing cross-border data sharing, and motivating developers to reduce bias were acknowledged. The event emphasized the necessity of diverse viewpoints and multi-dimensional dialogue for creating a fair and ethical AI framework for equitable global health. △ Less

Submitted 10 September, 2023; originally announced September 2023.

Comments: 7 pages

arXiv:2308.12371 [pdf, other]

Open-set Face Recognition with Neural Ensemble, Maximal Entropy Loss and Feature Augmentation

Authors: Rafael Henrique Vareto, Manuel Günther, William Robson Schwartz

Abstract: Open-set face recognition refers to a scenario in which biometric systems have incomplete knowledge of all existing subjects. Therefore, they are expected to prevent face samples of unregistered subjects from being identified as previously enrolled identities. This watchlist context adds an arduous requirement that calls for the dismissal of irrelevant faces by focusing mainly on subjects of inter… ▽ More Open-set face recognition refers to a scenario in which biometric systems have incomplete knowledge of all existing subjects. Therefore, they are expected to prevent face samples of unregistered subjects from being identified as previously enrolled identities. This watchlist context adds an arduous requirement that calls for the dismissal of irrelevant faces by focusing mainly on subjects of interest. As a response, this work introduces a novel method that associates an ensemble of compact neural networks with a margin-based cost function that explores additional samples. Supplementary negative samples can be obtained from external databases or synthetically built at the representation level in training time with a new mix-up feature augmentation approach. Deep neural networks pre-trained on large face datasets serve as the preliminary feature extraction module. We carry out experiments on well-known LFW and IJB-C datasets where results show that the approach is able to boost closed and open-set identification rates. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Journal ref: 36th Conference on Graphics, Patterns and Images (SIBGRAPI 2023)

arXiv:2308.07746 [pdf, other]

A Tight Competitive Ratio for Online Submodular Welfare Maximization

Authors: Amit Ganz, Pranav Nuti, Roy Schwartz

Abstract: In this paper we consider the online Submodular Welfare (SW) problem. In this problem we are given $n$ bidders each equipped with a general (not necessarily monotone) submodular utility and $m$ items that arrive online. The goal is to assign each item, once it arrives, to a bidder or discard it, while maximizing the sum of utilities. When an adversary determines the items' arrival order we present… ▽ More In this paper we consider the online Submodular Welfare (SW) problem. In this problem we are given $n$ bidders each equipped with a general (not necessarily monotone) submodular utility and $m$ items that arrive online. The goal is to assign each item, once it arrives, to a bidder or discard it, while maximizing the sum of utilities. When an adversary determines the items' arrival order we present a simple randomized algorithm that achieves a tight competitive ratio of $\nicefrac{1}{4}$. The algorithm is a specialization of an algorithm due to [Harshaw-Kazemi-Feldman-Karbasi MOR`22], who presented the previously best known competitive ratio of $3-2\sqrt{2}\approx 0.171573 $ to the problem. When the items' arrival order is uniformly random, we present a competitive ratio of $\approx 0.27493$, improving the previously known $\nicefrac{1}{4}$ guarantee. Our approach for the latter result is based on a better analysis of the (offline) Residual Random Greedy (RRG) algorithm of [Buchbinder-Feldman-Naor-Schwartz SODA`14], which we believe might be of independent interest. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.07445 [pdf, other]

doi 10.1109/IJCB48548.2020.9304882

Open-set Face Recognition using Ensembles trained on Clustered Data

Authors: Rafael Henrique Vareto, William Robson Schwartz

Abstract: Open-set face recognition describes a scenario where unknown subjects, unseen during the training stage, appear on test time. Not only it requires methods that accurately identify individuals of interest, but also demands approaches that effectively deal with unfamiliar faces. This work details a scalable open-set face identification approach to galleries composed of hundreds and thousands of subj… ▽ More Open-set face recognition describes a scenario where unknown subjects, unseen during the training stage, appear on test time. Not only it requires methods that accurately identify individuals of interest, but also demands approaches that effectively deal with unfamiliar faces. This work details a scalable open-set face identification approach to galleries composed of hundreds and thousands of subjects. It is composed of clustering and an ensemble of binary learning algorithms that estimates when query face samples belong to the face gallery and then retrieves their correct identity. The approach selects the most suitable gallery subjects and uses the ensemble to improve prediction performance. We carry out experiments on well-known LFW and YTF benchmarks. Results show that competitive performance can be achieved even when targeting scalability. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: [Original paper title: Unconstrained Face Identification using Ensembles trained on Clustered Data] [2020 IEEE International Joint Conference on Biometrics (IJCB)] [https://ieeexplore.ieee.org/document/9304882]

arXiv:2308.03516 [pdf, other]

An Improved Approximation Algorithm for the Max-$3$-Section Problem

Authors: Dor Katzelnick, Aditya Pillai, Roy Schwartz, Mohit Singh

Abstract: We consider the Max-$3$-Section problem, where we are given an undirected graph $ G=(V,E)$ equipped with non-negative edge weights $w :E\rightarrow \mathbb{R}_+$ and the goal is to find a partition of $V$ into three equisized parts while maximizing the total weight of edges crossing between different parts. Max-$3$-Section is closely related to other well-studied graph partitioning problems, e.g.,… ▽ More We consider the Max-$3$-Section problem, where we are given an undirected graph $ G=(V,E)$ equipped with non-negative edge weights $w :E\rightarrow \mathbb{R}_+$ and the goal is to find a partition of $V$ into three equisized parts while maximizing the total weight of edges crossing between different parts. Max-$3$-Section is closely related to other well-studied graph partitioning problems, e.g., Max-$k$-Cut, Max-$3$-Cut, and Max-Bisection. We present a polynomial time algorithm achieving an approximation of $ 0.795$, that improves upon the previous best known approximation of $ 0.673$. The requirement of multiple parts that have equal sizes renders Max-$3$-Section much harder to cope with compared to, e.g., Max-Bisection. We show a new algorithm that combines the existing approach of Lassere hierarchy along with a random cut strategy that suffices to give our result. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2307.04532 [pdf, other]

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

Authors: Netta Madvil, Yonatan Bitton, Roy Schwartz

Abstract: The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship bet… ▽ More The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2306.16900 [pdf, other]

Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research

Authors: Ji-Ung Lee, Haritz Puerto, Betty van Aken, Yuki Arase, Jessica Zosa Forde, Leon Derczynski, Andreas Rücklé, Iryna Gurevych, Roy Schwartz, Emma Strubell, Jesse Dodge

Abstract: Many recent improvements in NLP stem from the development and use of large pre-trained language models (PLMs) with billions of parameters. Large model sizes makes computational cost one of the main limiting factors for training and evaluating such models; and has raised severe concerns about the sustainability, reproducibility, and inclusiveness for researching PLMs. These concerns are often based… ▽ More Many recent improvements in NLP stem from the development and use of large pre-trained language models (PLMs) with billions of parameters. Large model sizes makes computational cost one of the main limiting factors for training and evaluating such models; and has raised severe concerns about the sustainability, reproducibility, and inclusiveness for researching PLMs. These concerns are often based on personal experiences and observations. However, there had not been any large-scale surveys that investigate them. In this work, we provide a first attempt to quantify these concerns regarding three topics, namely, environmental impact, equity, and impact on peer reviewing. By conducting a survey with 312 participants from the NLP community, we capture existing (dis)parities between different and within groups with respect to seniority, academia, and industry; and their impact on the peer reviewing process. For each topic, we provide an analysis and devise recommendations to mitigate found disparities, some of which already successfully implemented. Finally, we discuss additional concerns raised by many participants in free-text responses. △ Less

Submitted 9 November, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

arXiv:2306.06205 [pdf, other]

doi 10.1017/S1351324923000190

Morphosyntactic probing of multilingual BERT models

Authors: Judit Acs, Endre Hamerlik, Roy Schwartz, Noah A. Smith, Andras Kornai

Abstract: We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain st… ▽ More We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain strong performance across these tasks. We then apply two methods to locate, for each probing task, where the disambiguating information resides in the input. The first is a new perturbation method that masks various parts of context; the second is the classical method of Shapley values. The most intriguing finding that emerges is a strong tendency for the preceding context to hold more information relevant to the prediction than the following context. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: to appear in the Journal of Natural Language Engineering

arXiv:2306.02307 [pdf, other]

Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference in Low Resource Settings

Authors: Daniel Rotem, Michael Hassid, Jonathan Mamou, Roy Schwartz

Abstract: Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the sam… ▽ More Adaptive inference is a simple method for reducing inference costs. The method works by maintaining multiple classifiers of different capacities, and allocating resources to each test instance according to its difficulty. In this work, we compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited. First, we observe that for models with the same architecture and size, individual Multi-Model classifiers outperform their Early-Exit counterparts by an average of 2.3%. We show that this gap is caused by Early-Exit classifiers sharing model parameters during training, resulting in conflicting gradient updates of model weights. We find that despite this gap, Early-Exit still provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach. To address these issues, we propose SWEET (Separating Weights in Early Exit Transformers), an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights, not updated by other classifiers. We compare SWEET's speed-accuracy curve to standard Early-Exit and Multi-Model baselines and find that it outperforms both methods at fast speeds while maintaining comparable scores to Early-Exit at slow speeds. Moreover, SWEET individual classifiers outperform Early-Exit ones by 1.1% on average. SWEET enjoys the benefits of both methods, paving the way for further reduction of inference costs in NLP. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Proceedings of ACL 2023

arXiv:2305.18917 [pdf, other]

Fighting Bias with Bias: Promoting Model Robustness by Amplifying Dataset Biases

Authors: Yuval Reif, Roy Schwartz

Abstract: NLP models often rely on superficial cues known as dataset biases to achieve impressive performance, and can fail on examples where these biases do not hold. Recent work sought to develop robust, unbiased models by filtering biased examples from training sets. In this work, we argue that such filtering can obscure the true capabilities of models to overcome biases, which might never be removed in… ▽ More NLP models often rely on superficial cues known as dataset biases to achieve impressive performance, and can fail on examples where these biases do not hold. Recent work sought to develop robust, unbiased models by filtering biased examples from training sets. In this work, we argue that such filtering can obscure the true capabilities of models to overcome biases, which might never be removed in full from the dataset. We suggest that in order to drive the development of models robust to subtle biases, dataset biases should be amplified in the training set. We introduce an evaluation framework defined by a bias-amplified training set and an anti-biased test set, both automatically extracted from existing datasets. Experiments across three notions of bias, four datasets and two models show that our framework is substantially more challenging for models than the original data splits, and even more challenging than hand-crafted challenge sets. Our evaluation framework can use any existing dataset, even those considered obsolete, to test model robustness. We hope our work will guide the development of robust models that do not rely on superficial biases and correlations. To this end, we publicly release our code and data. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Findings of ACL 2023

arXiv:2305.17313 [pdf, other]

doi 10.1016/j.cag.2023.05.005

Super-Resolution of License Plate Images Using Attention Modules and Sub-Pixel Convolution Layers

Authors: Valfride Nascimento, Rayson Laroca, Jorge de A. Lambert, William Robson Schwartz, David Menotti

Abstract: Recent years have seen significant developments in the field of License Plate Recognition (LPR) through the integration of deep learning techniques and the increasing availability of training data. Nevertheless, reconstructing license plates (LPs) from low-resolution (LR) surveillance footage remains challenging. To address this issue, we introduce a Single-Image Super-Resolution (SISR) approach t… ▽ More Recent years have seen significant developments in the field of License Plate Recognition (LPR) through the integration of deep learning techniques and the increasing availability of training data. Nevertheless, reconstructing license plates (LPs) from low-resolution (LR) surveillance footage remains challenging. To address this issue, we introduce a Single-Image Super-Resolution (SISR) approach that integrates attention and transformer modules to enhance the detection of structural and textural features in LR images. Our approach incorporates sub-pixel convolution layers (also known as PixelShuffle) and a loss function that uses an Optical Character Recognition (OCR) model for feature extraction. We trained the proposed architecture on synthetic images created by applying heavy Gaussian noise to high-resolution LP images from two public datasets, followed by bicubic downsampling. As a result, the generated images have a Structural Similarity Index Measure (SSIM) of less than 0.10. Our results show that our approach for reconstructing these low-resolution synthesized images outperforms existing ones in both quantitative and qualitative measures. Our code is publicly available at https://github.com/valfride/lpr-rsr-ext/ △ Less

Submitted 26 May, 2023; originally announced May 2023.

Journal ref: Computers & Graphics, vol. 113, pp. 69-76, 2023

arXiv:2305.13009 [pdf, other]

Textually Pretrained Speech Language Models

Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ . △ Less

Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023

arXiv:2303.07274 [pdf, other]

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Authors: Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz

Abstract: Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconvent… ▽ More Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io △ Less

Submitted 12 August, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted to ICCV 2023. Website: whoops-benchmark.github.io

arXiv:2212.04542 [pdf, other]

VASR: Visual Analogies of Situation Recognition

Authors: Yonatan Bitton, Ron Yosef, Eli Strugo, Dafna Shahaf, Roy Schwartz, Gabriel Stanovsky

Abstract: A core process in human cognition is analogical map**: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to wha… ▽ More A core process in human cognition is analogical map**: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy). We hope our dataset will encourage the development of new analogy-making models. Website: https://vasr-dataset.github.io/ △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: Accepted to AAAI 2023. Website: https://vasr-dataset.github.io/

arXiv:2211.03495 [pdf, other]

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Authors: Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz

Abstract: The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with… ▽ More The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Findings of EMNLP 2022

arXiv:2210.16836 [pdf, other]

doi 10.1109/SIBGRAPI55357.2022.9991753

Combining Attention Module and Pixel Shuffle for License Plate Super-Resolution

Authors: Valfride Nascimento, Rayson Laroca, Jorge de A. Lambert, William Robson Schwartz, David Menotti

Abstract: The License Plate Recognition (LPR) field has made impressive advances in the last decade due to novel deep learning approaches combined with the increased availability of training data. However, it still has some open issues, especially when the data come from low-resolution (LR) and low-quality images/videos, as in surveillance systems. This work focuses on license plate (LP) reconstruction in L… ▽ More The License Plate Recognition (LPR) field has made impressive advances in the last decade due to novel deep learning approaches combined with the increased availability of training data. However, it still has some open issues, especially when the data come from low-resolution (LR) and low-quality images/videos, as in surveillance systems. This work focuses on license plate (LP) reconstruction in LR and low-quality images. We present a Single-Image Super-Resolution (SISR) approach that extends the attention/transformer module concept by exploiting the capabilities of PixelShuffle layers and that has an improved loss function based on LPR predictions. For training the proposed architecture, we use synthetic images generated by applying heavy Gaussian noise in terms of Structural Similarity Index Measure (SSIM) to the original high-resolution (HR) images. In our experiments, the proposed method outperformed the baselines both quantitatively and qualitatively. The datasets we created for this work are publicly available to the research community at https://github.com/valfride/lpr-rsr/ △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2022

arXiv:2209.00099 [pdf, other]

Efficient Methods for Natural Language Processing: A Survey

Authors: Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz

Abstract: Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require few… ▽ More Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for develo** more efficient methods. △ Less

Submitted 24 March, 2023; v1 submitted 31 August, 2022; originally announced September 2022.

Comments: Accepted at TACL, pre publication version

arXiv:2207.12576 [pdf, other]

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Authors: Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, Roy Schwartz

Abstract: While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a… ▽ More While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities. △ Less

Submitted 11 October, 2022; v1 submitted 25 July, 2022; originally announced July 2022.

Comments: Accepted to NeurIPS 2022, Datasets and Benchmarks. Website: https://winogavil.github.io/

arXiv:2206.09860 [pdf, other]

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

Authors: Yarden Tal, Inbal Magar, Roy Schwartz

Abstract: The size of pretrained models is increasing, and so is their performance on a variety of NLP tasks. However, as their memorization capacity grows, they might pick up more social biases. In this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). We measure bias in three masked language model families (RoBERTa, DeBERTa, and T5) in two set… ▽ More The size of pretrained models is increasing, and so is their performance on a variety of NLP tasks. However, as their memorization capacity grows, they might pick up more social biases. In this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). We measure bias in three masked language model families (RoBERTa, DeBERTa, and T5) in two setups: directly using prompt based method, and using a downstream task (Winogender). We find on the one hand that larger models receive higher bias scores on the former task, but when evaluated on the latter, they make fewer gender errors. To examine these potentially conflicting results, we carefully investigate the behavior of the different models on Winogender. We find that while larger models outperform smaller ones, the probability that their mistakes are caused by gender bias is higher. Moreover, we find that the proportion of stereotypical errors compared to anti-stereotypical ones grows with the model size. Our findings highlight the potential risks that can arise from increasing model size. △ Less

Submitted 20 June, 2022; originally announced June 2022.

arXiv:2206.05229 [pdf, other]

Measuring the Carbon Intensity of AI in Cloud Instances

Authors: Jesse Dodge, Taylor Prewitt, Remi Tachet Des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, Will Buchanan

Abstract: By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access… ▽ More By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access to measurements of this information, precluding development of actionable tactics. Cloud providers presenting information about software carbon intensity to users is a fundamental step** stone towards minimizing emissions. In this paper, we provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions by using location-based and time-specific marginal emissions data per energy unit. We provide measurements of operational software carbon intensity for a set of modern models for natural language processing and computer vision, and a wide range of model sizes, including pretraining of a 6.1 billion parameter language model. We then evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform: using cloud instances in different geographic regions, using cloud instances at different times of day, and dynamically pausing cloud instances when the marginal carbon intensity is above a certain threshold. We confirm previous results that the geographic region of the data center plays a significant role in the carbon intensity for a given cloud instance, and find that choosing an appropriate region can have the largest operational emissions reduction impact. We also show that the time of day has notable impact on operational software carbon intensity. Finally, we conclude with recommendations for how machine learning practitioners can use software carbon intensity information to reduce environmental impact. △ Less

Submitted 10 June, 2022; originally announced June 2022.

Comments: In ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2022

arXiv:2204.12708 [pdf, other]

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Authors: Roy Schwartz, Gabriel Stanovsky

Abstract: Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to elimi… ▽ More Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: Findings of NAACL 2022

arXiv:2204.06271 [pdf, other]

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

Authors: Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Roy Schwartz

Abstract: The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient… ▽ More The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model, and only part of those instances are additionally processed by a less efficient but more accurate second tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. Our simple method has several appealing practical advantages compared to standard cascading approaches based on multi-layered transformer models. First, it enables higher speedup gains (average lower latency). Second, it takes advantage of batch size optimization for cascading, which increases the relative inference cost reductions. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task. Experimental results show that TangoBERT outperforms efficient early exit baseline models; on the the SST-2 task, it achieves an accuracy of 93.9% with a CPU speedup of 8.2x. △ Less

Submitted 13 April, 2022; originally announced April 2022.

arXiv:2204.02406 [pdf]

doi 10.1167/tvst.11.12.3

A deep learning framework for the detection and quantification of drusen and reticular pseudodrusen on optical coherence tomography

Authors: Roy Schwartz, Hagar Khalid, Sandra Liakopoulos, Yanling Ouyang, Coen de Vente, Cristina González-Gonzalo, Aaron Y. Lee, Robyn Guymer, Emily Y. Chew, Catherine Egan, Zhichao Wu, Himeesh Kumar, Joseph Farrington, Clara I. Sánchez, Adnan Tufail

Abstract: Purpose - To develop and validate a deep learning (DL) framework for the detection and quantification of drusen and reticular pseudodrusen (RPD) on optical coherence tomography scans. Design - Development and validation of deep learning models for classification and feature segmentation. Methods - A DL framework was developed consisting of a classification model and an out-of-distribution (OOD… ▽ More Purpose - To develop and validate a deep learning (DL) framework for the detection and quantification of drusen and reticular pseudodrusen (RPD) on optical coherence tomography scans. Design - Development and validation of deep learning models for classification and feature segmentation. Methods - A DL framework was developed consisting of a classification model and an out-of-distribution (OOD) detection model for the identification of ungradable scans; a classification model to identify scans with drusen or RPD; and an image segmentation model to independently segment lesions as RPD or drusen. Data were obtained from 1284 participants in the UK Biobank (UKBB) with a self-reported diagnosis of age-related macular degeneration (AMD) and 250 UKBB controls. Drusen and RPD were manually delineated by five retina specialists. The main outcome measures were sensitivity, specificity, area under the ROC curve (AUC), kappa, accuracy and intraclass correlation coefficient (ICC). Results - The classification models performed strongly at their respective tasks (0.95, 0.93, and 0.99 AUC, respectively, for the ungradable scans classifier, the OOD model, and the drusen and RPD classification model). The mean ICC for drusen and RPD area vs. graders was 0.74 and 0.61, respectively, compared with 0.69 and 0.68 for intergrader agreement. FROC curves showed that the model's sensitivity was close to human performance. Conclusions - The models achieved high classification and segmentation performance, similar to human performance. Application of this robust framework will further our understanding of RPD as a separate entity from drusen in both research and clinical settings. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: 26 pages, 7 figures

arXiv:2203.08242 [pdf, other]

Data Contamination: From Memorization to Exploitation

Authors: Inbal Magar, Roy Schwartz

Abstract: Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the rele… ▽ More Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: Accepted to ACL 2022

arXiv:2112.12297 [pdf]

High Throughput Multi-Channel Parallelized Diffraction Convolutional Neural Network Accelerator

Authors: Zibo Hu, Shurui Li, Russell L. T. Schwartz, Maria Solyanik-Gorgone, Mario Miscuglio, Puneet Gupta, Volker J. Sorger

Abstract: Convolutional neural networks are paramount in image and signal processing including the relevant classification and training tasks alike and constitute for the majority of machine learning compute demand today. With convolution operations being computationally intensive, next generation hardware accelerators need to offer parallelization and algorithmic-hardware homomorphism. Fortunately, diffrac… ▽ More Convolutional neural networks are paramount in image and signal processing including the relevant classification and training tasks alike and constitute for the majority of machine learning compute demand today. With convolution operations being computationally intensive, next generation hardware accelerators need to offer parallelization and algorithmic-hardware homomorphism. Fortunately, diffractive display optics is capable of million-channel parallel data processing at low latency, however, thus far only showed tens of Hertz slow single image and kernel capability, thereby significantly underdelivering from its performance potential. Here, we demonstrate an operation-parallelized high-throughput Fourier optic convolutional neural network accelerator. For the first time simultaneously processing of multiple kernels in Fourier domain enabled by optical diffraction has been achieved alongside with already conventional in the field input parallelism. Additionally, we show an about one hundred times system speed up over existing optical diffraction-based processors and this demonstration rivals performance of modern electronic solutions. Therefore, this system is capable of processing large-scale matrices about ten times faster than state of art electronic systems. △ Less

Submitted 7 July, 2022; v1 submitted 22 December, 2021; originally announced December 2021.

Comments: 13 pages, 4 figures

arXiv:2110.02488 [pdf, other]

ABC: Attention with Bounded-memory Control

Authors: Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, Noah A. Smith

Abstract: Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size… ▽ More Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights--an established approach (Wang et al., 2020b) previously thought to be not applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss. △ Less

Submitted 1 June, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

arXiv:2110.00613 [pdf, other]

Expected Validation Performance and Estimation of a Random Variable's Maximum

Authors: Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, Noah A. Smith

Abstract: Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments).… ▽ More Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE. △ Less

Submitted 1 October, 2021; originally announced October 2021.

arXiv:2109.02040 [pdf, other]

Data Efficient Masked Language Modeling for Vision and Language

Authors: Yonatan Bitton, Gabriel Stanovsky, Michael Elhadad, Roy Schwartz

Abstract: Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Sec… ▽ More Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In this paper, we observe several key disadvantages of MLM in this setting. First, as captions tend to be short, in a third of the sentences no token is sampled. Second, the majority of masked tokens are stop-words and punctuation, leading to under-utilization of the image. We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings, aiming for better fusion of text and image in the learned representation. When pre-training the LXMERT model, our alternative masking strategies consistently improve over the original masking strategy on three downstream tasks, especially in low resource settings. Further, our pre-training approach substantially outperforms the baseline model on a prompt-based probing task designed to elicit image objects. These results and our analysis indicate that our method allows for better utilization of the training data. △ Less

Submitted 5 September, 2021; originally announced September 2021.

Comments: Accepted to Findings of EMNLP 2021

arXiv:2107.05605 [pdf, other]

Interpretable Mammographic Image Classification using Case-Based Reasoning and Deep Learning

Authors: Alina Jade Barnett, Fides Regina Schwartz, Chaofan Tao, Chaofan Chen, Yinhao Ren, Joseph Y. Lo, Cynthia Rudin

Abstract: When we deploy machine learning models in high-stakes medical settings, we must ensure these models make accurate predictions that are consistent with known medical science. Inherently interpretable networks address this need by explaining the rationale behind each decision while maintaining equal or higher accuracy compared to black-box models. In this work, we present a novel interpretable neura… ▽ More When we deploy machine learning models in high-stakes medical settings, we must ensure these models make accurate predictions that are consistent with known medical science. Inherently interpretable networks address this need by explaining the rationale behind each decision while maintaining equal or higher accuracy compared to black-box models. In this work, we present a novel interpretable neural network algorithm that uses case-based reasoning for mammography. Designed to aid a radiologist in their decisions, our network presents both a prediction of malignancy and an explanation of that prediction using known medical features. In order to yield helpful explanations, the network is designed to mimic the reasoning processes of a radiologist: our network first detects the clinically relevant semantic features of each image by comparing each new image with a learned set of prototypical image parts from the training images, then uses those clinical features to predict malignancy. Compared to other methods, our model detects clinical features (mass margins) with equal or higher accuracy, provides a more detailed explanation of its prediction, and is better able to differentiate the classification-relevant parts of the image. △ Less

Submitted 4 October, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: 10 pages, 6 figures, accepted for oral presentation at the IJCAI-21 Workshop on Deep Learning, Case-Based Reasoning, and AutoML: Present and Future Synergies. arXiv admin note: substantial text overlap with arXiv:2103.12308

ACM Class: I.2.6; I.4.9; I.2.10

arXiv:2106.05939 [pdf, other]

Graph Balancing with Orientation Costs

Authors: Roy Schwartz, Ran Yeheskel

Abstract: Motivated by the classic Generalized Assignment Problem, we consider the Graph Balancing problem in the presence of orientation costs: given an undirected multi-graph G = (V,E) equipped with edge weights and orientation costs on the edges, the goal is to find an orientation of the edges that minimizes both the maximum weight of edges oriented toward any vertex (makespan) and total orientation cost… ▽ More Motivated by the classic Generalized Assignment Problem, we consider the Graph Balancing problem in the presence of orientation costs: given an undirected multi-graph G = (V,E) equipped with edge weights and orientation costs on the edges, the goal is to find an orientation of the edges that minimizes both the maximum weight of edges oriented toward any vertex (makespan) and total orientation cost. We present a general framework for minimizing makespan in the presence of costs that allows us to: (1) achieve bicriteria approximations for the Graph Balancing problem that capture known previous results (Shmoys-Tardos [Math. Progrm. 93], Ebenlendr-Krcál- Sgall [Algorithmica 14], and Wang-Sitters [Inf. Process. Lett. 16]); and (2) achieve bicriteria approximations for extensions of the Graph Balancing problem that admit hyperedges and unrelated weights. Our framework is based on a remarkably simple rounding of a strengthened linear relaxation. We complement the above by presenting bicriteria lower bounds with respect to the linear programming relaxations we use that show that a loss in the total orientation cost is required if one aims for an approximation better than 2 in the makespan. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Journal ref: ESA 2019: 82:1-82:15

arXiv:2105.07238 [pdf]

Using Ethnographic Methods to Classify the Human Experience in Medicine: A Case Study of the Presence Ontology

Authors: Amrapali Maitra, Maulik R. Kamdar, Donna M. Zulman, Marie C. Haverfield, Cati Brown-Johnson, Rachel Schwartz, Sonoo Thadaney Israni, Abraham Verghese, Mark A. Musen

Abstract: Objective Although social and environmental factors are central to provider patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. Methods Our top down approach for ontology development based on the concept of re… ▽ More Objective Although social and environmental factors are central to provider patient interactions, the data that reflect these factors can be incomplete, vague, and subjective. We sought to create a conceptual framework to describe and classify data about presence, the domain of interpersonal connection in medicine. Methods Our top down approach for ontology development based on the concept of relationality included 1) broad survey of social sciences literature and systematic literature review of more than 20,000 articles around interpersonal connection in medicine, 3) relational ethnography of clinical encounters (5 pilot, 27 full) and 4) interviews about relational work with 40 medical and nonmedical professionals. We formalized the model using the Web Ontology Language in the Protege ontology editor. We iteratively evaluated and refined the Presence Ontology through manual expert review and automated annotation of literature. Results and Discussion The Presence Ontology facilitates the naming and classification of concepts that would otherwise be vague. Our model categorizes contributors to healthcare encounters and factors such as Communication, Emotions, Tools, and Environment. Ontology evaluation indicated that Cognitive Models (both patients explanatory models and providers caregiving approaches) influenced encounters and were subsequently incorporated. We show how ethnographic methods based in relationality can aid the representation of experiential concepts (e.g., empathy, trust). Our ontology could support informatics applications to improve healthcare such annotation of videotaped encounters, clinical instruments to measure presence, or EHR based reminders for providers. Conclusion The Presence Ontology provides a model for using ethnographic approaches to classify interpersonal data. △ Less

Submitted 15 May, 2021; originally announced May 2021.

Comments: 15 pages, 4 figures, 57 references

arXiv:2105.07054 [pdf, other]

doi 10.1109/FG47880.2020.00088

Face Attributes as Cues for Deep Face Recognition Understanding

Authors: Matheus Alves Diniz, William Robson Schwartz

Abstract: Deeply learned representations are the state-of-the-art descriptors for face recognition methods. These representations encode latent features that are difficult to explain, compromising the confidence and interpretability of their predictions. Most attempts to explain deep features are visualization techniques that are often open to interpretation. Instead of relying only on visualizations, we us… ▽ More Deeply learned representations are the state-of-the-art descriptors for face recognition methods. These representations encode latent features that are difficult to explain, compromising the confidence and interpretability of their predictions. Most attempts to explain deep features are visualization techniques that are often open to interpretation. Instead of relying only on visualizations, we use the outputs of hidden layers to predict face attributes. The obtained performance is an indicator of how well the attribute is implicitly learned in that layer of the network. Using a variable selection technique, we also analyze how these semantic concepts are distributed inside each layer, establishing the precise location of relevant neurons for each attribute. According to our experiments, gender, eyeglasses and hat usage can be predicted with over 96% accuracy even when only a single neural output is used to predict each attribute. These performances are less than 3 percentage points lower than the ones achieved by deep supervised face attribute networks. In summary, our experiments show that, inside DCNNs optimized for face identification, there exists latent neurons encoding face attributes almost as accurately as DCNNs optimized for these attributes. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Comments: 7 pages, 5 figures, published at automatic face and gesture recognition 2020

arXiv:2105.06967 [pdf, other]

doi 10.1109/IWSSIP48289.2020.9145245

Open-set Face Recognition for Small Galleries Using Siamese Networks

Authors: Gabriel Salomon, Alceu Britto, Rafael H. Vareto, William R. Schwartz, David Menotti

Abstract: Face recognition has been one of the most relevant and explored fields of Biometrics. In real-world applications, face recognition methods usually must deal with scenarios where not all probe individuals were seen during the training phase (open-set scenarios). Therefore, open-set face recognition is a subject of increasing interest as it deals with identifying individuals in a space where not all… ▽ More Face recognition has been one of the most relevant and explored fields of Biometrics. In real-world applications, face recognition methods usually must deal with scenarios where not all probe individuals were seen during the training phase (open-set scenarios). Therefore, open-set face recognition is a subject of increasing interest as it deals with identifying individuals in a space where not all faces are known in advance. This is useful in several applications, such as access authentication, on which only a few individuals that have been previously enrolled in a gallery are allowed. The present work introduces a novel approach towards open-set face recognition focusing on small galleries and in enrollment detection, not identity retrieval. A Siamese Network architecture is proposed to learn a model to detect if a face probe is enrolled in the gallery based on a verification-like approach. Promising results were achieved for small galleries on experiments carried out on Pubfig83, FRGCv1 and LFW datasets. State-of-the-art methods like HFCN and HPLS were outperformed on FRGCv1. Besides, a new evaluation protocol is introduced for experiments in small galleries on LFW. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Journal ref: 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 2020, pp. 161-166

arXiv:2105.01138 [pdf, other]

Fault Tolerant Max-Cut

Authors: Keren Censor-Hillel, Noa Marelly, Roy Schwartz, Tigran Tonoyan

Abstract: In this work, we initiate the study of fault tolerant Max Cut, where given an edge-weighted undirected graph $G=(V,E)$, the goal is to find a cut $S\subseteq V$ that maximizes the total weight of edges that cross $S$ even after an adversary removes $k$ vertices from $G$. We consider two types of adversaries: an adaptive adversary that sees the outcome of the random coin tosses used by the algorith… ▽ More In this work, we initiate the study of fault tolerant Max Cut, where given an edge-weighted undirected graph $G=(V,E)$, the goal is to find a cut $S\subseteq V$ that maximizes the total weight of edges that cross $S$ even after an adversary removes $k$ vertices from $G$. We consider two types of adversaries: an adaptive adversary that sees the outcome of the random coin tosses used by the algorithm, and an oblivious adversary that does not. For any constant number of failures $k$ we present an approximation of $(0.878-ε)$ against an adaptive adversary and of $α_{GW}\approx 0.8786$ against an oblivious adversary (here $α_{GW}$ is the approximation achieved by the random hyperplane algorithm of [Goemans-Williamson J. ACM `95]). Additionally, we present a hardness of approximation of $α_{GW}$ against both types of adversaries, rendering our results (virtually) tight. The non-linear nature of the fault tolerant objective makes the design and analysis of algorithms harder when compared to the classic Max Cut. Hence, we employ approaches ranging from multi-objective optimization to LP duality and the ellipsoid algorithm to obtain our results. △ Less

Submitted 3 May, 2021; originally announced May 2021.

Comments: 29 pages, 4 figures, conference: ICALP '21

arXiv:2104.11670 [pdf, other]

doi 10.1145/3406325.3451071

The Metric Relaxation for $0$-Extension Admits an $Ω(\log^{2/3}{k})$ Gap

Authors: Roy Schwartz, Nitzan Tur

Abstract: We consider the $0$-Extension problem, where we are given an undirected graph $\mathcal{G}=(V,E)$ equipped with non-negative edge weights $w:E\rightarrow \mathbb{R}^+$, a collection $ T=\{ t_1,\ldots,t_k\}\subseteq V$ of $k$ special vertices called terminals, and a semi-metric $D$ over $T$. The goal is to assign every non-terminal vertex to a terminal while minimizing the sum over all edges of the… ▽ More We consider the $0$-Extension problem, where we are given an undirected graph $\mathcal{G}=(V,E)$ equipped with non-negative edge weights $w:E\rightarrow \mathbb{R}^+$, a collection $ T=\{ t_1,\ldots,t_k\}\subseteq V$ of $k$ special vertices called terminals, and a semi-metric $D$ over $T$. The goal is to assign every non-terminal vertex to a terminal while minimizing the sum over all edges of the weight of the edge multiplied by the distance in $D$ between the terminals to which the endpoints of the edge are assigned. $0$-Extension admits two known algorithms, achieving approximations of $O(\log{k})$ [C{ă}linescu-Karloff-Rabani SICOMP '05] and $O(\log{k}/\log{\log{k}})$ [Fakcharoenphol-Harrelson-Rao-Talwar SODA '03]. Both known algorithms are based on rounding a natural linear programming relaxation called the metric relaxation, in which $D$ is extended from $T$ to the entire of $V$. The current best known integrality gap for the metric relaxation is $Ω(\sqrt{\log{k}})$. In this work we present an improved integrality gap of $Ω(\log^{\frac{2}{3}}k)$ for the metric relaxation. Our construction is based on the randomized extension of one graph by another, a notion that captures lifts of graphs as a special case and might be of independent interest. Inspired by algebraic topology, our analysis of the gap instance is based on proving no continuous section (in the topological sense) exists in the randomized extension. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Comments: 27 pages, 3 figures, will appear in STOC 2021

arXiv:2104.10809 [pdf, other]

Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?

Authors: William Merrill, Yoav Goldberg, Roy Schwartz, Noah A. Smith

Abstract: Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever ``understand'' raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of ``assertions'': textual contexts… ▽ More Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever ``understand'' raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of ``assertions'': textual contexts that provide indirect clues about the underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation of languages that satisfy a strong notion of semantic transparency. However, for classes of languages where the same expression can take different values in different contexts, we show that emulation can become uncomputable. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to ``understand''. △ Less

Submitted 22 June, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

Comments: Updated 06/22/21 with substantive changes. Accepted at TACL; pre-MIT Press publication version

arXiv:2103.12308 [pdf, other]

IAIA-BL: A Case-based Interpretable Deep Learning Model for Classification of Mass Lesions in Digital Mammography

Authors: Alina Jade Barnett, Fides Regina Schwartz, Chaofan Tao, Chaofan Chen, Yinhao Ren, Joseph Y. Lo, Cynthia Rudin

Abstract: Interpretability in machine learning models is important in high-stakes decisions, such as whether to order a biopsy based on a mammographic exam. Mammography poses important challenges that are not present in other computer vision tasks: datasets are small, confounding information is present, and it can be difficult even for a radiologist to decide between watchful waiting and biopsy based on a m… ▽ More Interpretability in machine learning models is important in high-stakes decisions, such as whether to order a biopsy based on a mammographic exam. Mammography poses important challenges that are not present in other computer vision tasks: datasets are small, confounding information is present, and it can be difficult even for a radiologist to decide between watchful waiting and biopsy based on a mammogram alone. In this work, we present a framework for interpretable machine learning-based mammography. In addition to predicting whether a lesion is malignant or benign, our work aims to follow the reasoning processes of radiologists in detecting clinically relevant semantic features of each image, such as the characteristics of the mass margins. The framework includes a novel interpretable neural network algorithm that uses case-based reasoning for mammography. Our algorithm can incorporate a combination of data with whole image labelling and data with pixel-wise annotations, leading to better accuracy and interpretability even with a small number of images. Our interpretable models are able to highlight the classification-relevant parts of the image, whereas other methods highlight healthy tissue and confounding information. Our models are decision aids, rather than decision makers, aimed at better overall human-machine collaboration. We do not observe a loss in mass margin classification accuracy over a black box neural network trained on the same data. △ Less

Submitted 23 March, 2021; originally announced March 2021.

Comments: 24 pages, 5 figures, 2 tables

ACM Class: I.2.6; I.4.9; I.2.10

arXiv:2103.09591 [pdf, other]

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Authors: Yonatan Bitton, Gabriel Stanovsky, Roy Schwartz, Michael Elhadad

Abstract: Recent works have shown that supervised models often exploit data artifacts to achieve good test scores while their performance severely degrades on samples outside their training distribution. Contrast sets (Gardneret al., 2020) quantify this phenomenon by perturbing test samples in a minimal way such that the output label is modified. While most contrast sets were created manually, requiring int… ▽ More Recent works have shown that supervised models often exploit data artifacts to achieve good test scores while their performance severely degrades on samples outside their training distribution. Contrast sets (Gardneret al., 2020) quantify this phenomenon by perturbing test samples in a minimal way such that the output label is modified. While most contrast sets were created manually, requiring intensive annotation effort, we present a novel method which leverages rich semantic input representation to automatically generate contrast sets for the visual question answering task. Our method computes the answer of perturbed questions, thus vastly reducing annotation cost and enabling thorough evaluation of models' performance on various semantic aspects (e.g., spatial or relational reasoning). We demonstrate the effectiveness of our approach on the GQA dataset and its semantic scene graph image representation. We find that, despite GQA's compositionality and carefully balanced label distribution, two high-performing models drop 13-17% in accuracy compared to the original test set. Finally, we show that our automatic perturbation can be applied to the training set to mitigate the degradation in performance, opening the door to more robust models. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: Accepted to NAACL 2021

arXiv:2103.02143 [pdf, other]

Random Feature Attention

Authors: Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, Lingpeng Kong

Abstract: Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that us… ▽ More Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints. △ Less

Submitted 19 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: ICLR 2021

Showing 1–50 of 112 results for author: Schwartz, R