Search | arXiv e-print repository

Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus

Authors: Jalil Nourmohammadi Khiarak, Ammar Ahmadi, Taher Ak-bari Saeed, Meysam Asgari-Chenaghlu, Toğrul Atabay, Mohammad Reza Baghban Karimi, Ismail Ceferli, Farzad Hasanvand, Seyed Mahboub Mousavi, Morteza Noshad

Abstract: This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance… ▽ More This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance natural language processing (NLP) applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: This paper is accepted and published at NeTTT 2024 Conf

arXiv:2402.09603 [pdf, other]

Scalable Graph Self-Supervised Learning

Authors: Ali Saheb Pasand, Reza Moravej, Mahdi Biparva, Raika Karimi, Ali Ghodsi

Abstract: In regularization Self-Supervised Learning (SSL) methods for graphs, computational complexity increases with the number of nodes in graphs and embedding dimensions. To mitigate the scalability of non-contrastive graph SSL, we propose a novel approach to reduce the cost of computing the covariance matrix for the pre-training loss function with volume-maximization terms. Our work focuses on reducing… ▽ More In regularization Self-Supervised Learning (SSL) methods for graphs, computational complexity increases with the number of nodes in graphs and embedding dimensions. To mitigate the scalability of non-contrastive graph SSL, we propose a novel approach to reduce the cost of computing the covariance matrix for the pre-training loss function with volume-maximization terms. Our work focuses on reducing the cost associated with the loss computation via graph node or dimension sampling. We provide theoretical insight into why dimension sampling would result in accurate loss computations and support it with mathematical derivation of the novel approach. We develop our experimental setup on the node-level graph prediction tasks, where SSL pre-training has shown to be difficult due to the large size of real world graphs. Our experiments demonstrate that the cost associated with the loss computation can be reduced via node or dimension sampling without lowering the downstream performance. Our results demonstrate that sampling mostly results in improved downstream performance. Ablation studies and experimental analysis are provided to untangle the role of the different factors in the experimental setup. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.05944 [pdf, other]

Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization

Authors: Mahdi Biparva, Raika Karimi, Faezeh Faez, Yingxue Zhang

Abstract: Temporal Graph Neural Networks have garnered substantial attention for their capacity to model evolving structural and temporal patterns while exhibiting impressive performance. However, it is known that these architectures are encumbered by issues that constrain their performance, such as over-squashing and over-smoothing. Meanwhile, Transformers have demonstrated exceptional computational capaci… ▽ More Temporal Graph Neural Networks have garnered substantial attention for their capacity to model evolving structural and temporal patterns while exhibiting impressive performance. However, it is known that these architectures are encumbered by issues that constrain their performance, such as over-squashing and over-smoothing. Meanwhile, Transformers have demonstrated exceptional computational capacity to effectively address challenges related to long-range dependencies. Consequently, we introduce Todyformer-a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers through i) a novel patchifying paradigm for dynamic graphs to improve over-squashing, ii) a structure-aware parametric tokenization strategy leveraging MPNNs, iii) a Transformer with temporal positional-encoding to capture long-range dependencies, and iv) an encoding architecture that alternates between local and global contextualization, mitigating over-smoothing in MPNNs. Experimental evaluations on public benchmark datasets demonstrate that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks. Furthermore, we illustrate the underlying aspects of the proposed model in effectively capturing extensive temporal dependencies in dynamic graphs. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2311.16706 [pdf, ps, other]

Sinkhorn Flow: A Continuous-Time Framework for Understanding and Generalizing the Sinkhorn Algorithm

Authors: Mohammad Reza Karimi, Ya-** Hsieh, Andreas Krause

Abstract: Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we b… ▽ More Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics not only generalize but also offer a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of (Deb et al. 2023) or the "mean-field Schrödinger equation" of (Claisse et al. 2023). △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.02374 [pdf, other]

Riemannian stochastic optimization methods avoid strict saddle points

Authors: Ya-** Hsieh, Mohammad Reza Karimi, Andreas Krause, Panayotis Mertikopoulos

Abstract: Many modern machine learning applications - from online principal component analysis to covariance matrix identification and dictionary learning - can be formulated as minimization problems on Riemannian manifolds, and are typically solved with a Riemannian stochastic gradient method (or some variant thereof). However, in many cases of interest, the resulting minimization problem is not geodesical… ▽ More Many modern machine learning applications - from online principal component analysis to covariance matrix identification and dictionary learning - can be formulated as minimization problems on Riemannian manifolds, and are typically solved with a Riemannian stochastic gradient method (or some variant thereof). However, in many cases of interest, the resulting minimization problem is not geodesically convex, so the convergence of the chosen solver to a desirable solution - i.e., a local minimizer - is by no means guaranteed. In this paper, we study precisely this question, that is, whether stochastic Riemannian optimization algorithms are guaranteed to avoid saddle points with probability 1. For generality, we study a family of retraction-based methods which, in addition to having a potentially much lower per-iteration cost relative to Riemannian gradient descent, include other widely used algorithms, such as natural policy gradient methods and mirror descent in ordinary convex spaces. In this general setting, we show that, under mild assumptions for the ambient manifold and the oracle providing gradient information, the policies under study avoid strict saddle points / submanifolds with probability 1, from any initial condition. This result provides an important sanity check for the use of gradient methods on manifolds as it shows that, almost always, the limit state of a stochastic Riemannian algorithm can only be a local minimizer. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: 27 pages, 3 figures

MSC Class: Primary 62L20; 37N40; secondary 90C15; 90C48

arXiv:2310.02862 [pdf, other]

A novel asymmetrical autoencoder with a sparsifying discrete cosine Stockwell transform layer for gearbox sensor data compression

Authors: Xin Zhu, Daoguang Yang, Hongyi Pan, Hamid Reza Karimi, Didem Ozevin, Ahmet Enis Cetin

Abstract: The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer… ▽ More The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer autoencoder. A trainable filter is implemented in the DCST domain by utilizing the multiplication property of the convolution. A trainable hard-thresholding layer is applied to reduce redundant data in the DCST layer to make the feature map sparse. In comparison to the linear layer, the DCST layer reduces the number of trainable parameters and improves the accuracy of data reconstruction. Second, training the autoencoder with a sparsifying DCST layer only requires a small number of datasets. The proposed method is superior to other autoencoder-based methods on the University of Connecticut (UoC) and Southeast University (SEU) gearbox datasets, as the average quality score is improved by 2.00% at the lowest and 32.35% at the highest with a limited number of training samples △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2212.06375 [pdf, ps, other]

doi 10.1016/j.nuclphysa.2023.122684

Hybrid stars within the framework of the Sigma-Omega-Rho model combined with the MIT and NJL models

Authors: Reza Karimi, H. R. Moshfegh

Abstract: In this paper, we investigate the structure of hybrid stars consisting of hadrons (neutrons, protons, sigmas, lambdas), leptons (electrons, muons), and quarks (up, down, strange). We use a relativistic mean-field (RMF) model namely the Sigma-omega-rho model for the hadronic phase and the MIT bag model as well as the NJL model for the quark phase. In addition, Maxwell and Gibbs conditions are emplo… ▽ More In this paper, we investigate the structure of hybrid stars consisting of hadrons (neutrons, protons, sigmas, lambdas), leptons (electrons, muons), and quarks (up, down, strange). We use a relativistic mean-field (RMF) model namely the Sigma-omega-rho model for the hadronic phase and the MIT bag model as well as the NJL model for the quark phase. In addition, Maxwell and Gibbs conditions are employed to investigate the hadron-Quark phase transition. Finally, by obtaining the mass-radius relation, $ M (M_{sun}) \leqslant 2.07 $ is predicted for such hybrid stars. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: 23 pages, 10 figures

arXiv:2211.01689 [pdf, other]

Isotropic Gaussian Processes on Finite Spaces of Graphs

Authors: Viacheslav Borovitskiy, Mohammad Reza Karimi, Vignesh Ram Somnath, Andreas Krause

Abstract: We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and ar… ▽ More We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Matérn. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime. △ Less

Submitted 25 February, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.13867 [pdf, ps, other]

A Dynamical System View of Langevin-Based Non-Convex Sampling

Authors: Mohammad Reza Karimi, Ya-** Hsieh, Andreas Krause

Abstract: Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain many important challenges: Existing guarantees (1) typically only hold for the averaged iterates rather than the more desirable last iterates, (2) lack convergence metrics that capture… ▽ More Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain many important challenges: Existing guarantees (1) typically only hold for the averaged iterates rather than the more desirable last iterates, (2) lack convergence metrics that capture the scales of the variables such as Wasserstein distances, and (3) mainly apply to elementary schemes such as stochastic gradient Langevin dynamics. In this paper, we develop a new framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as proximal, randomized mid-point, and Runge-Kutta integrators. Beyond existing methods, our framework also motivates more efficient schemes that enjoy the same rigorous guarantees. △ Less

Submitted 13 March, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: typos corrected, references added

MSC Class: 62D05

arXiv:2206.06795 [pdf, other]

Riemannian stochastic approximation algorithms

Authors: Mohammad Reza Karimi, Ya-** Hsieh, Panayotis Mertikopoulos, Andreas Krause

Abstract: We examine a wide class of stochastic approximation algorithms for solving (stochastic) nonlinear problems on Riemannian manifolds. Such algorithms arise naturally in the study of Riemannian optimization, game theory and optimal transport, but their behavior is much less understood compared to the Euclidean case because of the lack of a global linear structure on the manifold. We overcome this dif… ▽ More We examine a wide class of stochastic approximation algorithms for solving (stochastic) nonlinear problems on Riemannian manifolds. Such algorithms arise naturally in the study of Riemannian optimization, game theory and optimal transport, but their behavior is much less understood compared to the Euclidean case because of the lack of a global linear structure on the manifold. We overcome this difficulty by introducing a suitable Fermi coordinate frame which allows us to map the asymptotic behavior of the Riemannian Robbins-Monro (RRM) algorithms under study to that of an associated deterministic dynamical system. In so doing, we provide a general template of almost sure convergence results that mirrors and extends the existing theory for Euclidean Robbins-Monro schemes, despite the significant complications that arise due to the curvature and topology of the underlying manifold. We showcase the flexibility of the proposed framework by applying it to a range of retraction-based variants of the popular optimistic / extra-gradient methods for solving minimization problems and games, and we provide a unified treatment for their convergence. △ Less

Submitted 27 December, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: 33 pages, 2 figures; a one-page abstract of this paper was presented in COLT 2022

MSC Class: Primary 62L20; 37N40; secondary 90C15; 90C47; 90C48

arXiv:2204.01172 [pdf, other]

PERFECT: Prompt-free and Efficient Few-shot Learning with Language Models

Authors: Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert Mathias, Veselin Stoyanov, Majid Yazdani

Abstract: Current methods for few-shot fine-tuning of pretrained masked language models (PLMs) require carefully engineered prompts and verbalizers for each new task to convert examples into a cloze-format that the PLM can score. In this work, we propose PERFECT, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting, which is highly effective given as few as… ▽ More Current methods for few-shot fine-tuning of pretrained masked language models (PLMs) require carefully engineered prompts and verbalizers for each new task to convert examples into a cloze-format that the PLM can score. In this work, we propose PERFECT, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any such handcrafting, which is highly effective given as few as 32 data points. PERFECT makes two key design choices: First, we show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning and reduce memory and storage costs by roughly factors of 5 and 100, respectively. Second, instead of using handcrafted verbalizers, we learn new multi-token label embeddings during fine-tuning, which are not tied to the model vocabulary and which allow us to avoid complex auto-regressive decoding. These embeddings are not only learnable from limited data but also enable nearly 100x faster training and inference. Experiments on a wide range of few-shot NLP tasks demonstrate that PERFECT, while being simple and efficient, also outperforms existing state-of-the-art few-shot learning methods. Our code is publicly available at https://github.com/facebookresearch/perfect.git. △ Less

Submitted 25 April, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: ACL, 2022

arXiv:2204.00855 [pdf, ps, other]

doi 10.1088/1674-4527/ac6417

The First Photometric Study of AH Mic Contact Binary System

Authors: Atila Poro, Mark G. Blackford, Selda Ranjbar Salehian, Esfandiar Jahangiri, Meysam Samiei Dastjerdi, Mohammadjavad Gozarandi, Reihaneh Karimi, Tabassom Madayen, Elnaz Bakhshi, Farnad Hedayati

Abstract: The first multi-color light curve analysis of the AH Mic binary system is presented. This system has very few past observations from the southern hemisphere. We extracted the minima times from the light curves based on the Markov Chain Monte Carlo (MCMC) approach and obtained a new ephemeris. To provide modern photometric light curve solutions, we used the Physics of Eclipsing Binaries (Phoebe) so… ▽ More The first multi-color light curve analysis of the AH Mic binary system is presented. This system has very few past observations from the southern hemisphere. We extracted the minima times from the light curves based on the Markov Chain Monte Carlo (MCMC) approach and obtained a new ephemeris. To provide modern photometric light curve solutions, we used the Physics of Eclipsing Binaries (Phoebe) software package and the MCMC approach. Light curve solutions yielded a system temperature ratio of 0.950, and we assumed a cold star-spot for the hotter star based on the O'Connell effect. This analysis reveals that AH Mic is a W-subtype W UMa contact system with a fill-out factor of 21.3% and a mass ratio of 2.32. The absolute physical parameters of the components are estimated by using the Gaia Early Data Release 3 (EDR3) parallax method to be M_h(M_Sun)=0.702(26), M_c(M_Sun)=1.629(104), R_h(R_Sun)=0.852(21), R_c(R_Sun)=1.240(28), L_h(L_Sun)=0.618(3) and L_c(L_Sun)=1.067(7). The orbital angular momentum of the AH Mic binary system was found to be 51.866(35). The components' positions of this system are plotted in the Hertzsprung-Russell (H-R) diagram. △ Less

Submitted 5 April, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

Comments: 6 figures, 4 tables, accepted by the Research in Astronomy and Astrophysics (RAA) journal

arXiv:2201.00283 [pdf, other]

DF-SSmVEP: Dual Frequency Aggregated Steady-State Motion Visual Evoked Potential Design with Bifold Canonical Correlation Analysis

Authors: Raika Karimi, Arash Mohammadi, Amir Asif, Habib Benali

Abstract: Recent advancements in Electroencephalography (EEG) sensor technologies and signal processing algorithms have paved the way for further evolution of Brain Computer Interfaces (BCI). When it comes to Signal Processing (SP) for BCI, there has been a surge of interest on Steady-State motion-Visual Evoked Potentials (SSmVEP), where motion stimulation is utilized to address key issues associated with c… ▽ More Recent advancements in Electroencephalography (EEG) sensor technologies and signal processing algorithms have paved the way for further evolution of Brain Computer Interfaces (BCI). When it comes to Signal Processing (SP) for BCI, there has been a surge of interest on Steady-State motion-Visual Evoked Potentials (SSmVEP), where motion stimulation is utilized to address key issues associated with conventional light-flashing/flickering. Such benefits, however, come with the price of having less accuracy and less Information Transfer Rate (ITR). In this regard, the paper focuses on the design of a novel SSmVEP paradigm without using resources such as trial time, phase, and/or number of targets to enhance the ITR. The proposed design is based on the intuitively pleasing idea of integrating more than one motion within a single SSmVEP target stimuli, simultaneously. To elicit SSmVEP, we designed a novel and innovative dual frequency aggregated modulation paradigm, referred to as the Dual Frequency Aggregated steady-state motion Visual Evoked Potential (DF-SSmVEP), by concurrently integrating "Radial Zoom" and "Rotation" motions in a single target without increasing the trial length. Compared to conventional SSmVEPs, the proposed DF-SSmVEP framework consists of two motion modes integrated and shown simultaneously each modulated by a specific target frequency. The paper also develops a specific unsupervised classification model, referred to as the Bifold Canonical Correlation Analysis (BCCA), based on two motion frequencies per target. The proposed DF-SSmVEP is evaluated based on a real EEG dataset and the results corroborate its superiority. The proposed DF-SSmVEP outperforms its counterparts and achieved an average ITR of 30.7 +/- 1.97 and an average accuracy of 92.5 +/- 2.04. △ Less

Submitted 1 January, 2022; originally announced January 2022.

arXiv:2107.05151 [pdf, other]

Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

Authors: H. J. Meijer, J. Truong, R. Karimi

Abstract: Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific n… ▽ More Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals. △ Less

Submitted 11 July, 2021; originally announced July 2021.

arXiv:2106.04647 [pdf, other]

Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Authors: Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder

Abstract: Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent wor… ▽ More Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}. △ Less

Submitted 27 November, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: accepted in NeurIPS, 2021

arXiv:2106.04489 [pdf, other]

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

Authors: Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, James Henderson

Abstract: State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing information across tasks. In this paper, we show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which co… ▽ More State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing information across tasks. In this paper, we show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which condition on task, adapter position, and layer id in a transformer model. This parameter-efficient multi-task learning framework allows us to achieve the best of both worlds by sharing knowledge across tasks via hypernetworks while enabling the model to adapt to each individual task through task-specific adapters. Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29% parameters per task. We additionally demonstrate substantial performance improvements in few-shot domain generalization across a variety of tasks. Our code is publicly available in https://github.com/rabeehk/hyperformer. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: accepted in ACL, 2021

arXiv:2010.09818 [pdf, other]

Online Active Model Selection for Pre-trained Classifiers

Authors: Mohammad Reza Karimi, Nezihe Merve Gürel, Bojan Karlaš, Johannes Rausch, Ce Zhang, Andreas Krause

Abstract: Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative exa… ▽ More Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query a label so that we can distinguish the best model from the rest while making a small number of queries? Answering this question has a profound impact on a range of practical scenarios. In this work, we design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can be used for online prediction tasks for both adversarial and stochastic streams. We establish several theoretical guarantees for our algorithm and extensively demonstrate its effectiveness in our experimental studies. △ Less

Submitted 17 April, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2006.02464 [pdf, other]

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Authors: Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace

Abstract: Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpre… ▽ More Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation. △ Less

Submitted 26 October, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)

arXiv:2003.03304 [pdf, other]

doi 10.1109/IPDPS47924.2020.00063

Bandwidth-Aware Page Placement in NUMA

Authors: David Gureya, João Neto, Reza Karimi, João Barreto, Pramod Bhatotia, Vivien Quema, Rodrigo Rodrigues, Paolo Romano, Vladimir Vlassov

Abstract: Page placement is a critical problem for memoryintensive applications running on a shared-memory multiprocessor with a non-uniform memory access (NUMA) architecture. State-of-the-art page placement mechanisms interleave pages evenly across NUMA nodes. However, this approach fails to maximize memory throughput in modern NUMA systems, characterised by asymmetric bandwidths and latencies, and sensiti… ▽ More Page placement is a critical problem for memoryintensive applications running on a shared-memory multiprocessor with a non-uniform memory access (NUMA) architecture. State-of-the-art page placement mechanisms interleave pages evenly across NUMA nodes. However, this approach fails to maximize memory throughput in modern NUMA systems, characterised by asymmetric bandwidths and latencies, and sensitive to memory contention and interconnect congestion phenomena. We propose BWAP, a novel page placement mechanism based on asymmetric weighted page interleaving. BWAP combines an analytical performance model of the target NUMA system with on-line iterative tuning of page distribution for a given memory-intensive application. Our experimental evaluation with representative memory-intensive workloads shows that BWAP performs up to 66% better than state-of-the-art techniques. These gains are particularly relevant when multiple co-located applications run in disjoint partitions of a large NUMA machine or when applications do not scale up to the total number of cores. △ Less

Submitted 19 May, 2023; v1 submitted 6 March, 2020; originally announced March 2020.

Journal ref: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 2020 pp. 546-556

arXiv:2001.04198 [pdf, ps, other]

Predefined-time Terminal Sliding Mode Control of Robot Manipulators

Authors: Chang-Duo Liang, Ming-Feng Ge, Zhi-Wei Liu, Yan-Wu Wang, Hamid Reza Karimi

Abstract: In this paper, we present a new terminal sliding mode control to achieve predefined-time stability of robot manipulators. The proposed control is developed based on a novel predefined-time terminal sliding mode (PTSM) surface, on which the states are forced to reach the origin in a predefined time, i.e., the settling time is independent to the initial condition and can be explicitly user-defined v… ▽ More In this paper, we present a new terminal sliding mode control to achieve predefined-time stability of robot manipulators. The proposed control is developed based on a novel predefined-time terminal sliding mode (PTSM) surface, on which the states are forced to reach the origin in a predefined time, i.e., the settling time is independent to the initial condition and can be explicitly user-defined via adjusting some specific parameters called the predefined-time parameters. It is also demonstrated that the proposed control can provide satisfactory steady-state performance in the case of both external disturbances and parametric uncertainties. Besides, we present a formal systemic analysis method to derive the sufficient conditions for guaranteeing the predefined-time convergence of the closed-loop system. Finally, the effectiveness and performance of the presented control scheme are illustrated through both theoretical comparisons and numerical simulations. △ Less

Submitted 25 April, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

Comments: 10 pages, 9 figures, This draft is not intended for publication

arXiv:1909.06321 [pdf, other]

End-to-End Bias Mitigation by Modelling Biases in Corpora

Authors: Rabeeh Karimi Mahabadi, Yonatan Belinkov, James Henderson

Abstract: Several recent studies have shown that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models that fail to generalize to out-of-domain datasets and are likely to perform poorly in real-world scenarios. We propose two learning strategies to train neural models, which are more robust to such biases… ▽ More Several recent studies have shown that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models that fail to generalize to out-of-domain datasets and are likely to perform poorly in real-world scenarios. We propose two learning strategies to train neural models, which are more robust to such biases and transfer better to out-of-domain datasets. The biases are specified in terms of one or more bias-only models, which learn to leverage the dataset biases. During training, the bias-only models' predictions are used to adjust the loss of the base model to reduce its reliance on biases by down-weighting the biased examples and focusing the training on the hard examples. We experiment on large-scale natural language inference and fact verification benchmarks, evaluating on out-of-domain datasets that are specifically designed to assess the robustness of models against known biases in the training data. Results show that our debiasing methods greatly improve robustness in all settings and better transfer to other textual entailment datasets. Our code and data are publicly available in \url{https://github.com/rabeehk/robust-nli}. △ Less

Submitted 23 April, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: Accepted in ACL 2020 as a long paper

arXiv:1711.01566 [pdf, other]

Stochastic Submodular Maximization: The Case of Coverage Functions

Authors: Mohammad Reza Karimi, Mario Lucic, Hamed Hassani, Andreas Krause

Abstract: Stochastic optimization of continuous objectives is at the heart of modern machine learning. However, many important problems are of discrete nature and often involve submodular objectives. We seek to unleash the power of stochastic continuous optimization, namely stochastic gradient descent and its variants, to such discrete problems. We first introduce the problem of stochastic submodular optimi… ▽ More Stochastic optimization of continuous objectives is at the heart of modern machine learning. However, many important problems are of discrete nature and often involve submodular objectives. We seek to unleash the power of stochastic continuous optimization, namely stochastic gradient descent and its variants, to such discrete problems. We first introduce the problem of stochastic submodular optimization, where one needs to optimize a submodular objective which is given as an expectation. Our model captures situations where the discrete objective arises as an empirical risk (e.g., in the case of exemplar-based clustering), or is given as an explicit stochastic model (e.g., in the case of influence maximization in social networks). By exploiting that common extensions act linearly on the class of submodular functions, we employ projected stochastic gradient ascent and its variants in the continuous domain, and perform rounding to obtain discrete solutions. We focus on the rich and widely used family of weighted coverage functions. We show that our approach yields solutions that are guaranteed to match the optimal approximation guarantees, while reducing the computational cost by several orders of magnitude, as we demonstrate empirically. △ Less

Submitted 5 November, 2017; originally announced November 2017.

Comments: 31st Conference on Neural Information Processing Systems (NIPS 2017)

arXiv:1705.07400 [pdf, other]

MITHRIL: Mining Sporadic Associations for Cache Prefetching

Authors: Juncheng Yang, Reza Karimi, Trausti Sæmundsson, Avani Wildani, Ymir Vigfusson

Abstract: The growing pressure on cloud application scalability has accentuated storage performance as a critical bottle- neck. Although cache replacement algorithms have been extensively studied, cache prefetching - reducing latency by retrieving items before they are actually requested remains an underexplored area. Existing approaches to history-based prefetching, in particular, provide too few benefits… ▽ More The growing pressure on cloud application scalability has accentuated storage performance as a critical bottle- neck. Although cache replacement algorithms have been extensively studied, cache prefetching - reducing latency by retrieving items before they are actually requested remains an underexplored area. Existing approaches to history-based prefetching, in particular, provide too few benefits for real systems for the resources they cost. We propose MITHRIL, a prefetching layer that efficiently exploits historical patterns in cache request associations. MITHRIL is inspired by sporadic association rule mining and only relies on the timestamps of requests. Through evaluation of 135 block-storage traces, we show that MITHRIL is effective, giving an average of a 55% hit ratio increase over LRU and PROBABILITY GRAPH, a 36% hit ratio gain over AMP at reasonable cost. We further show that MITHRIL can supplement any cache replacement algorithm and be readily integrated into existing systems. Furthermore, we demonstrate the improvement comes from MITHRIL being able to capture mid-frequency blocks. △ Less

Submitted 21 May, 2017; originally announced May 2017.

arXiv:1608.01391 [pdf]

Language free character recognition using character sketch and center of gravity shifting

Authors: Masoud Nosrati, Fakhereh Rahimi, Ronak Karimi

Abstract: In this research, we present a heuristic method for character recognition. For this purpose, a sketch is constructed from the image that contains the character to be recognized. This sketch contains the most important pixels of image that are representatives of original image. These points are the most probable points in pixel-by-pixel matching of image that adapt to target image. Furthermore, a t… ▽ More In this research, we present a heuristic method for character recognition. For this purpose, a sketch is constructed from the image that contains the character to be recognized. This sketch contains the most important pixels of image that are representatives of original image. These points are the most probable points in pixel-by-pixel matching of image that adapt to target image. Furthermore, a technique called gravity shifting is utilized for taking over the problem of elongation of characters. The consequence of combining sketch and gravity techniques leaded to a language free character recognition method. This method can be implemented independently for real-time uses or in combination of other classifiers as a feature extraction algorithm. Low complexity and acceptable performance are the most impressive features of this method that let it to be simply implemented in mobile and battery-limited computing devices. Results show that in the best case 86% of accuracy is obtained and in the worst case 28% of recognized characters are accurate. △ Less

Submitted 3 August, 2016; originally announced August 2016.

Comments: World Applied Programming, Vol (6), Issue (2), July 2016

arXiv:1606.08789 [pdf]

doi 10.1088/0953-4075/49/21/215602

Ultrafast molecular dynamics of dissociative ionization in OCS probed by soft X-ray synchrotron radiation

Authors: Ali Ramadhan, Benji Wales, Isabelle Gauthier, Reza Karimi, Michael MacDonald, Lucia Zuin, Joe Sanderson

Abstract: Soft X-rays (90-173 eV) from the 3rd generation Canadian Light Source have been used in conjunction with a multi coincidence time and position sensitive detection apparatus to observe the dissociative ionization of OCS. By varying the X-ray energy we can compare dynamics from direct and Auger ionization processes, and access ionization channels which result in two or three body breakup, from 2+ to… ▽ More Soft X-rays (90-173 eV) from the 3rd generation Canadian Light Source have been used in conjunction with a multi coincidence time and position sensitive detection apparatus to observe the dissociative ionization of OCS. By varying the X-ray energy we can compare dynamics from direct and Auger ionization processes, and access ionization channels which result in two or three body breakup, from 2+ to 4+ ionization states. We make several new observations for the 3+ state such as kinetic energy release limited by photon energy, and using Dalitz plots we can see evidence of timescale effects between the direct and Auger ionization process for the first time. Finally, using Dalitz plots for OCS$^{4+}$ we observe for the first time that breakup involving an O$^{2+}$ ion can only proceed from out of equilibrium nuclear arrangement for S(2p) Auger ionization. △ Less

Submitted 15 August, 2016; v1 submitted 28 June, 2016; originally announced June 2016.

Comments: 24 pages, 8 figures, 1 table, 77 references

Journal ref: J. Phys. B: At. Mol. Opt. Phys. 49 215602 (2016)

arXiv:1605.06855 [pdf, other]

Smart broadcasting: Do you want to be seen?

Authors: Mohammad Reza Karimi, Erfan Tavakoli, Mehrdad Farajtabar, Le Song, Manuel Gomez-Rodriguez

Abstract: Many users in online social networks are constantly trying to gain attention from their followers by broadcasting posts to them. These broadcasters are likely to gain greater attention if their posts can remain visible for a longer period of time among their followers' most recent feeds. Then when to post? In this paper, we study the problem of smart broadcasting using the framework of temporal po… ▽ More Many users in online social networks are constantly trying to gain attention from their followers by broadcasting posts to them. These broadcasters are likely to gain greater attention if their posts can remain visible for a longer period of time among their followers' most recent feeds. Then when to post? In this paper, we study the problem of smart broadcasting using the framework of temporal point processes, where we model users feeds and posts as discrete events occurring in continuous time. Based on such continuous-time model, then choosing a broadcasting strategy for a user becomes a problem of designing the conditional intensity of her posting events. We derive a novel formula which links this conditional intensity with the visibility of the user in her followers' feeds. Furthermore, by exploiting this formula, we develop an efficient convex optimization framework for the when-to-post problem. Our method can find broadcasting strategies that reach a desired visibility level with provable guarantees. We experimented with data gathered from Twitter, and show that our framework can consistently make broadcasters' post more visible than alternatives. △ Less

Submitted 22 May, 2016; originally announced May 2016.

Comments: To appear in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco (CA, USA), 2016

arXiv:1605.01301 [pdf]

doi 10.1007/978-3-319-21404-7_26

Latency Optimization for Resource Allocation in Cloud Computing System

Authors: Masoud Nosrati, Abdolah Chalechale, Ronak Karimi

Abstract: Recent studies in different fields of science caused emergence of needs for high performance computing systems like Cloud. A critical issue in design and implementation of such systems is resource allocation which is directly affected by internal and external factors like the number of nodes, geographical distance and communication latencies. Many optimizations took place in resource allocation me… ▽ More Recent studies in different fields of science caused emergence of needs for high performance computing systems like Cloud. A critical issue in design and implementation of such systems is resource allocation which is directly affected by internal and external factors like the number of nodes, geographical distance and communication latencies. Many optimizations took place in resource allocation methods in order to achieve better performance by concentrating on computing, network and energy resources. Communication latencies as a limitation of network resources have always been playing an important role in parallel processing (especially in fine-grained programs). In this paper, we are going to have a survey on the resource allocation issue in Cloud and then do an optimization on common resource allocation method based on the latencies of communications. Due to it, we added a table to Resource Agent (entity that allocates resources to the applicants) to hold the history of previous allocations. Then, a probability matrix was constructed for allocation of resources partially based on the history of latencies. Response time was considered as a metric for evaluation of proposed method. Results indicated the better response time, especially by increasing the number of tasks. Besides, the proposed method is inherently capable for detecting the unavailable resources through measuring the communication latencies. It assists other issues in cloud systems like migration, resource replication and fault tolerance. △ Less

Submitted 4 May, 2016; originally announced May 2016.

Comments: 12 pages, 5 figures, In proceeding of ICCSA 2015, published by Springer LNCS

arXiv:1111.6539 [pdf]

Secure Geographic Routing Protocols: Issues and Approaches

Authors: Mehdi sookhak, Ramin Karimi, Mahboobeh Haghparast, Ismail Fauzi ISnin

Abstract: In the years, routing protocols in wireless sensor networks (WSN) have been substantially investigated by researches. Most state-of-the-art surveys have focused on reviewing of wireless sensor network .In this paper we review the existing secure geographic routing protocols for wireless sensor network (WSN) and also provide a qualitative comparison of them. In the years, routing protocols in wireless sensor networks (WSN) have been substantially investigated by researches. Most state-of-the-art surveys have focused on reviewing of wireless sensor network .In this paper we review the existing secure geographic routing protocols for wireless sensor network (WSN) and also provide a qualitative comparison of them. △ Less

Submitted 28 November, 2011; originally announced November 2011.

Comments: 8 pages

Journal ref: International Journal of Computer Science Issues 8(4): 382-389 (2011)

Showing 1–28 of 28 results for author: Karimi, R