Search | arXiv e-print repository

Effective Controllable Bias Mitigation for Classification and Retrieval using Gate Adapters

Authors: Shahed Masoudian, Cornelia Volaucnik, Markus Schedl, Navid Rekabsaz

Abstract: Bias mitigation of Language Models has been the topic of many studies with a recent focus on learning separate modules like adapters for on-demand debiasing. Besides optimizing for a modularized debiased model, it is often critical in practice to control the degree of bias reduction at inference time, e.g., in order to tune for a desired performance-fairness trade-off in search results or to contr… ▽ More Bias mitigation of Language Models has been the topic of many studies with a recent focus on learning separate modules like adapters for on-demand debiasing. Besides optimizing for a modularized debiased model, it is often critical in practice to control the degree of bias reduction at inference time, e.g., in order to tune for a desired performance-fairness trade-off in search results or to control the strength of debiasing in classification tasks. In this paper, we introduce Controllable Gate Adapter (ConGater), a novel modular gating mechanism with adjustable sensitivity parameters, which allows for a gradual transition from the biased state of the model to the fully debiased version at inference time. We demonstrate ConGater performance by (1) conducting adversarial debiasing experiments with three different models on three classification tasks with four protected attributes, and (2) reducing the bias of search results through fairness list-wise regularization to enable adjusting a trade-off between performance and fairness metrics. Our experiments on the classification tasks show that compared to baselines of the same caliber, ConGater can maintain higher task performance while containing less information regarding the attributes. Our results on the retrieval task show that the fully debiased ConGater can achieve the same fairness performance while maintaining more than twice as high task performance than recent strong baselines. Overall, besides strong performance ConGater enables the continuous transitioning between biased and debiased states of models, enhancing personalization of use and interpretability through controllability. △ Less

Submitted 19 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Paper is accepted to main proceedings of EACL 2024

arXiv:2310.01217 [pdf, other]

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

Authors: Markus Frohmann, Carolin Holtermann, Shahed Masoudian, Anne Lauscher, Navid Rekabsaz

Abstract: Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning $n$ tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (… ▽ More Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning $n$ tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two-stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~ $0.35$% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only $8$ transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer. △ Less

Submitted 17 May, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: Accepted to Findings of the ACL: ACL 2024

arXiv:2308.10675 [pdf, ps, other]

A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays

Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

Abstract: We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{\mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, whic… ▽ More We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{\mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, which may all be of independent interest: (1) We introduce the first implicit exploration scheme that works in best-of-both-worlds setting. (2) We introduce the first control of distribution drift that does not rely on boundedness of delays. The control is based on the implicit exploration scheme and adaptive skip** of observations with excessive delays. (3) We introduce a procedure relating standard regret with drifted regret that does not rely on boundedness of delays. At the conceptual level, we demonstrate that complexity of best-of-both-worlds bandits with delayed feedback is characterized by the amount of information missing at the time of decision making (measured by the number of outstanding observations) rather than the time that the information is missing (measured by the delays). △ Less

Submitted 27 May, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2306.08010 [pdf, other]

Domain Information Control at Inference Time for Acoustic Scene Classification

Authors: Shahed Masoudian, Khaled Koutini, Markus Schedl, Gerhard Widmer, Navid Rekabsaz

Abstract: Domain shift is considered a challenge in machine learning as it causes significant degradation of model performance. In the Acoustic Scene Classification task (ASC), domain shift is mainly caused by different recording devices. Several studies have already targeted domain generalization to improve the performance of ASC models on unseen domains, such as new devices. Recently, the Controllable Gat… ▽ More Domain shift is considered a challenge in machine learning as it causes significant degradation of model performance. In the Acoustic Scene Classification task (ASC), domain shift is mainly caused by different recording devices. Several studies have already targeted domain generalization to improve the performance of ASC models on unseen domains, such as new devices. Recently, the Controllable Gate Adapter ConGater has been proposed in Natural Language Processing to address the biased training data problem. ConGater allows controlling the debiasing process at inference time. ConGater's main advantage is the continuous and selective debiasing of a trained model, during inference. In this work, we adapt ConGater to the audio spectrogram transformer for an acoustic scene classification task. We show that ConGater can be used to selectively adapt the learned representations to be invariant to device domain shifts such as recording devices. Our analysis shows that ConGater can progressively remove device information from the learned representations and improve the model generalization, especially under domain shift conditions (e.g. unseen devices). We show that information removal can be extended to both device and location domain. Finally, we demonstrate ConGater's ability to enhance specific device performance without further training. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2305.19036 [pdf, other]

Delayed Bandits: When Do Intermediate Observations Help?

Authors: Emmanuel Esposito, Saeed Masoudian, Hao Qiu, Dirk van der Hoeven, Nicolò Cesa-Bianchi, Yevgeny Seldin

Abstract: We study a $K$-armed bandit with delayed feedback and intermediate observations. We consider a model where intermediate observations have a form of a finite state, which is observed immediately after taking an action, whereas the loss is observed after an adversarially chosen delay. We show that the regime of the map** of states to losses determines the complexity of the problem, irrespective of… ▽ More We study a $K$-armed bandit with delayed feedback and intermediate observations. We consider a model where intermediate observations have a form of a finite state, which is observed immediately after taking an action, whereas the loss is observed after an adversarially chosen delay. We show that the regime of the map** of states to losses determines the complexity of the problem, irrespective of whether the map** of actions to states is stochastic or adversarial. If the map** of states to losses is adversarial, then the regret rate is of order $\sqrt{(K+d)T}$ (within log factors), where $T$ is the time horizon and $d$ is a fixed delay. This matches the regret rate of a $K$-armed bandit with delayed feedback and without intermediate observations, implying that intermediate observations are not helpful. However, if the map** of states to losses is stochastic, we show that the regret grows at a rate of $\sqrt{\big(K+\min\{|\mathcal{S}|,d\}\big)T}$ (within log factors), implying that if the number $|\mathcal{S}|$ of states is smaller than the delay, then intermediate observations help. We also provide refined high-probability regret upper bounds for non-uniform delays, together with experimental validation of our algorithms. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2211.13956 [pdf, other]

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

Authors: Khaled Koutini, Shahed Masoudian, Florian Schmid, Hamid Eghbal-zadeh, Jan Schlüter, Gerhard Widmer

Abstract: The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature ex… ▽ More The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks. △ Less

Submitted 2 March, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: will apear in HEAR: Holistic Evaluation of Audio Representations Proceedings of Machine Learning Research PMLR 166. Source code: https://github.com/kkoutini/passt_hear21

Journal ref: Proceedings of Machine Learning Research v166 (2022) 65-89

arXiv:2206.14906 [pdf, ps, other]

A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

Abstract: We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is… ▽ More We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{d}{Δ_{i}\log K}) + d K^{1/3}\log K\right)$, where $Δ_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{Δ_i} \log(T) + \frac{σ_{max}}{Δ_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $σ_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skip** technique of Zimmert and Seldin [2020] in the adversarial setting. △ Less

Submitted 29 June, 2022; originally announced June 2022.

arXiv:2205.15171 [pdf, other]

Modular and On-demand Bias Mitigation with Attribute-Removal Subnetworks

Authors: Lukas Hauzenberger, Shahed Masoudian, Deepak Kumar, Markus Schedl, Navid Rekabsaz

Abstract: Societal biases are reflected in large pre-trained language models and their fine-tuned versions on downstream tasks. Common in-processing bias mitigation approaches, such as adversarial training and mutual information removal, introduce additional optimization criteria, and update the model to reach a new debiased state. However, in practice, end-users and practitioners might prefer to switch bac… ▽ More Societal biases are reflected in large pre-trained language models and their fine-tuned versions on downstream tasks. Common in-processing bias mitigation approaches, such as adversarial training and mutual information removal, introduce additional optimization criteria, and update the model to reach a new debiased state. However, in practice, end-users and practitioners might prefer to switch back to the original model, or apply debiasing only on a specific subset of protected attributes. To enable this, we propose a novel modular bias mitigation approach, consisting of stand-alone highly sparse debiasing subnetworks, where each debiasing module can be integrated into the core model on-demand at inference time. Our approach draws from the concept of \emph{diff} pruning, and proposes a novel training regime adaptable to various representation disentanglement optimizations. We conduct experiments on three classification tasks with gender, race, and age as protected attributes. The results show that our modular approach, while maintaining task performance, improves (or at least remains on-par with) the effectiveness of bias mitigation in comparison with baseline finetuning. Particularly on a two-attribute dataset, our approach with separately learned debiasing subnetworks shows effective utilization of either or both the subnetworks for selective bias mitigation. △ Less

Submitted 4 June, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: Accepted in Findings of ACL 2023

arXiv:2103.12487 [pdf, ps, other]

Improved Analysis of the Tsallis-INF Algorithm in Stochastically Constrained Adversarial Bandits and Stochastic Bandits with Adversarial Corruptions

Authors: Saeed Masoudian, Yevgeny Seldin

Abstract: We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(Δ,C,T)$ self-bounding constraint the algorithm achieves… ▽ More We derive improved regret bounds for the Tsallis-INF algorithm of Zimmert and Seldin (2021). We show that in adversarial regimes with a $(Δ,C,T)$ self-bounding constraint the algorithm achieves $\mathcal{O}\left(\left(\sum_{i\neq i^*} \frac{1}{Δ_i}\right)\log_+\left(\frac{(K-1)T}{\left(\sum_{i\neq i^*} \frac{1}{Δ_i}\right)^2}\right)+\sqrt{C\left(\sum_{i\neq i^*}\frac{1}{Δ_i}\right)\log_+\left(\frac{(K-1)T}{C\sum_{i\neq i^*}\frac{1}{Δ_i}}\right)}\right)$ regret bound, where $T$ is the time horizon, $K$ is the number of arms, $Δ_i$ are the suboptimality gaps, $i^*$ is the best arm, $C$ is the corruption magnitude, and $\log_+(x) = \max\left(1,\log x\right)$. The regime includes stochastic bandits, stochastically constrained adversarial bandits, and stochastic bandits with adversarial corruptions as special cases. Additionally, we provide a general analysis, which allows to achieve the same kind of improvement for generalizations of Tsallis-INF to other settings beyond multiarmed bandits. △ Less

Submitted 13 September, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: Published Version in COLT 2021

Journal ref: Conference on Learning Theory 134 (2021) 3330-3350

arXiv:2011.11736 [pdf, other]

Accurate and Rapid Diagnosis of COVID-19 Pneumonia with Batch Effect Removal of Chest CT-Scans and Interpretable Artificial Intelligence

Authors: Rassa Ghavami Modegh, Mehrab Hamidi, Saeed Masoudian, Amir Mohseni, Hamzeh Lotfalinezhad, Mohammad Ali Kazemi, Behnaz Moradi, Mahyar Ghafoori, Omid Motamedi, Omid Pournik, Kiara Rezaei-Kalantari, Amirreza Manteghinezhad, Shaghayegh Haghjooy Javanmard, Fateme Abdoli Nezhad, Ahmad Enhesari, Mohammad Saeed Kheyrkhah, Razieh Eghtesadi, Javid Azadbakht, Akbar Aliasgharzadeh, Mohammad Reza Sharif, Ali Khaleghi, Abbas Foroutan, Hossein Ghanaati, Hamed Dashti, Hamid R. Rabiee

Abstract: COVID-19 is a virus with high transmission rate that demands rapid identification of the infected patients to reduce the spread of the disease. The current gold-standard test, Reverse-Transcription Polymerase Chain Reaction (RT-PCR), has a high rate of false negatives. Diagnosing from CT-scan images as a more accurate alternative has the challenge of distinguishing COVID-19 from other pneumonia di… ▽ More COVID-19 is a virus with high transmission rate that demands rapid identification of the infected patients to reduce the spread of the disease. The current gold-standard test, Reverse-Transcription Polymerase Chain Reaction (RT-PCR), has a high rate of false negatives. Diagnosing from CT-scan images as a more accurate alternative has the challenge of distinguishing COVID-19 from other pneumonia diseases. Artificial intelligence can help radiologists and physicians to accelerate the process of diagnosis, increase its accuracy, and measure the severity of the disease. We designed a new interpretable deep neural network to distinguish healthy people, patients with COVID-19, and patients with other pneumonia diseases from axial lung CT-scan images. Our model also detects the infected areas and calculates the percentage of the infected lung volume. We first preprocessed the images to eliminate the batch effects of different devices, and then adopted a weakly supervised method to train the model without having any tags for the infected parts. We trained and evaluated the model on a large dataset of 3359 samples from 6 different medical centers. The model reached sensitivities of 97.75% and 98.15%, and specificities of 87% and 81.03% in separating healthy people from the diseased and COVID-19 from other diseases, respectively. It also demonstrated similar performance for 1435 samples from 6 different medical centers which proves its generalizability. The performance of the model on a large diverse dataset, its generalizability, and interpretability makes it suitable to be used as a reliable diagnostic system. △ Less

Submitted 8 January, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: 27 pages, 4 figures. Some minor changes have been applied to the text, some fomulae are added to help the descriptions become more clear, two names and two names are corrected (The full version of the names are included)

arXiv:1906.00290 [pdf, other]

Adaptive Online Learning for Gradient-Based Optimizers

Authors: Saeed Masoudian, Ali Arabzadeh, Mahdi Jafari Siavoshani, Milad Jalal, Alireza Amouzad

Abstract: As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online optimization methods usually perform well only in specific settings, and their performance depends highly on the geometry of the decision space and cost functions. Howeve… ▽ More As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online optimization methods usually perform well only in specific settings, and their performance depends highly on the geometry of the decision space and cost functions. However, in practice, lack of such geometric information leads to confusion in using the appropriate algorithm. To address this issue, some adaptive methods have been proposed that focus on adaptively learning parameters such as step size, Lipschitz constant, and strong convexity coefficient, or on specific parametric families such as quadratic regularizers. In this work, we generalize these methods and propose a framework that competes with the best algorithm in a family of expert algorithms. Our framework includes many of the well-known adaptive methods including MetaGrad, MetaGrad+C, and Ader. We also introduce a second algorithm that computationally outperforms our first algorithm with at most a constant factor increase in regret. Finally, as a representative application of our proposed algorithm, we study the problem of learning the best regularizer from a family of regularizers for Online Mirror Descent. Empirically, we support our theoretical findings in the problem of learning the best regularizer on the simplex and $l_2$-ball in a multiclass learning problem. △ Less

Submitted 1 June, 2019; originally announced June 2019.

arXiv:1511.05070 [pdf, ps, other]

Joining Transition Systems of Records: Some Congruency and Language-Theoretic Results

Authors: Mohammad Izadi, Saeed Masoudian, Sahand Mozaffari

Abstract: Büchi automaton of records (BAR) has been proposed as a basic operational semantics for Reo coordination language. It is an extension of Büchi automaton by using a set of records as its alphabet or transition labels. Records are used to express the synchrony between the externally visible actions of coordinated components modeled by BARs. The main composition operator on the set of BARs is called… ▽ More Büchi automaton of records (BAR) has been proposed as a basic operational semantics for Reo coordination language. It is an extension of Büchi automaton by using a set of records as its alphabet or transition labels. Records are used to express the synchrony between the externally visible actions of coordinated components modeled by BARs. The main composition operator on the set of BARs is called as join which is the semantics of its counterpart in Reo. In this paper, we define the notion of labeled transition systems of records as a generalization of the notion of BAR, abstracting away from acceptance or rejection of strings. Then, we consider four equivalence relations (semantics) over the set of labeled transition systems of records and investigate their congruency with respect to the join composition operator. In fact, we prove that the finite-traces-based, infinite-traces-based, and nondeterministic finite automata (NFA)-based equivalence relations all are congruence relations over the set of all labeled transition systems of records with respect to the join operation. However, the equivalence relation using Büchi acceptance condition is not so. In addition, using these results, we introduce the language-theoretic definitions of the join operation considering both finite and infinite strings notions. Also, we show that there is no language-based and structure-independent definition of the join operation on Büchi automata of records. △ Less

Submitted 16 November, 2015; originally announced November 2015.

Showing 1–12 of 12 results for author: Masoudian, S