-
Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus
Authors:
Jalil Nourmohammadi Khiarak,
Ammar Ahmadi,
Taher Ak-bari Saeed,
Meysam Asgari-Chenaghlu,
Toğrul Atabay,
Mohammad Reza Baghban Karimi,
Ismail Ceferli,
Farzad Hasanvand,
Seyed Mahboub Mousavi,
Morteza Noshad
Abstract:
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance…
▽ More
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance natural language processing (NLP) applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
STANLEY: Stochastic Gradient Anisotropic Langevin Dynamics for Learning Energy-Based Models
Authors:
Belhal Karimi,
Jianwen Xie,
** Li
Abstract:
We propose in this paper, STANLEY, a STochastic gradient ANisotropic LangEvin dYnamics, for sampling high dimensional data. With the growing efficacy and potential of Energy-Based modeling, also known as non-normalized probabilistic modeling, for modeling a generative process of different natures of high dimensional data observations, we present an end-to-end learning algorithm for Energy-Based mo…
▽ More
We propose in this paper, STANLEY, a STochastic gradient ANisotropic LangEvin dYnamics, for sampling high dimensional data. With the growing efficacy and potential of Energy-Based modeling, also known as non-normalized probabilistic modeling, for modeling a generative process of different natures of high dimensional data observations, we present an end-to-end learning algorithm for Energy-Based models (EBM) with the purpose of improving the quality of the resulting sampled data points. While the unknown normalizing constant of EBMs makes the training procedure intractable, resorting to Markov Chain Monte Carlo (MCMC) is in general a viable option. Realizing what MCMC entails for the EBM training, we propose in this paper, a novel high dimensional sampling method, based on an anisotropic stepsize and a gradient-informed covariance matrix, embedded into a discretized Langevin diffusion. We motivate the necessity for an anisotropic update of the negative samples in the Markov Chain by the nonlinearity of the backbone of the EBM, here a Convolutional Neural Network. Our resulting method, namely STANLEY, is an optimization algorithm for training Energy-Based models via our newly introduced MCMC method. We provide a theoretical understanding of our sampling scheme by proving that the sampler leads to a geometrically uniformly ergodic Markov Chain. Several image generation experiments are provided in our paper to show the effectiveness of our method.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems
Authors:
Weijie Zhao,
Xuewu Jiao,
Xinsheng Luo,
**gxue Li,
Belhal Karimi,
** Li
Abstract:
Deep learning has been widely deployed for online ads systems to predict Click-Through Rate (CTR). Machine learning researchers and practitioners frequently retrain CTR models to test their new extracted features. However, the CTR model training often relies on a large number of raw input data logs. Hence, the feature extraction can take a significant proportion of the training time for an industr…
▽ More
Deep learning has been widely deployed for online ads systems to predict Click-Through Rate (CTR). Machine learning researchers and practitioners frequently retrain CTR models to test their new extracted features. However, the CTR model training often relies on a large number of raw input data logs. Hence, the feature extraction can take a significant proportion of the training time for an industrial-level CTR model. In this paper, we propose FeatureBox, a novel end-to-end training framework that pipelines the feature extraction and the training on GPU servers to save the intermediate I/O of the feature extraction. We rewrite computation-intensive feature extraction operators as GPU operators and leave the memory-intensive operator on CPUs. We introduce a layer-wise operator scheduling algorithm to schedule these heterogeneous operators. We present a light-weight GPU memory management algorithm that supports dynamic GPU memory allocation with minimal overhead. We experimentally evaluate FeatureBox and compare it with the previous in-production feature extraction framework on two real-world ads applications. The results confirm the effectiveness of our proposed method.
△ Less
Submitted 25 September, 2022;
originally announced October 2022.
-
Variational Flow Graphical Model
Authors:
Shaogang Ren,
Belhal Karimi,
Dingcheng Li,
** Li
Abstract:
This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representat…
▽ More
This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. Aggregation nodes are introduced in the VFG models to integrate forward-backward hierarchical information via a message passing scheme. Maximizing the evidence lower bound (ELBO) of data likelihood aligns the forward and backward messages in each aggregation node achieving a consistency node state. Algorithms have been developed to learn model parameters through gradient updating regarding the ELBO objective.
The consistency of aggregation nodes enable VFGs to be applicable in tractable inference on graphical structures. Besides representation learning and numerical inference, VFGs provide a new approach for distribution modeling on datasets with graphical latent structures. Additionally, theoretical study shows that VFGs are universal approximators by leveraging the implicitly invertible flow-based structures. With flexible graphical structures and superior excessive power, VFGs could potentially be used to improve probabilistic inference. In the experiments, VFGs achieves improved evidence lower bound (ELBO) and likelihood values on multiple datasets.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
On Distributed Adaptive Optimization with Gradient Compression
Authors:
Xiaoyun Li,
Belhal Karimi,
** Li
Abstract:
We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits t…
▽ More
We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
Joint learning of object graph and relation graph for visual question answering
Authors:
Hao Li,
Xu Li,
Belhal Karimi,
Jie Chen,
Mingming Sun
Abstract:
Modeling visual question answering(VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecti…
▽ More
Modeling visual question answering(VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Network (DM-GNN), which can obtain a balanced representation by properly encoding multi-scale scene graph information. Specifically, we (i)transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which increases the weights from relations (ii)fuse the encoder output with attribute features, which increases the weights from attributes; (iii)propose a message-passing mechanism to enhance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets including GQA, VG, motif-VG and achieve new state of the art.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
A Class of Two-Timescale Stochastic EM Algorithms for Nonconvex Latent Variable Models
Authors:
Belhal Karimi,
** Li
Abstract:
The Expectation-Maximization (EM) algorithm is a popular choice for learning latent variable models. Variants of the EM have been initially introduced, using incremental updates to scale to large datasets, and using Monte Carlo (MC) approximations to bypass the intractable conditional expectation of the latent data for most nonconvex models. In this paper, we propose a general class of methods cal…
▽ More
The Expectation-Maximization (EM) algorithm is a popular choice for learning latent variable models. Variants of the EM have been initially introduced, using incremental updates to scale to large datasets, and using Monte Carlo (MC) approximations to bypass the intractable conditional expectation of the latent data for most nonconvex models. In this paper, we propose a general class of methods called Two-Timescale EM Methods based on a two-stage approach of stochastic updates to tackle an essential nonconvex optimization task for latent variable models. We motivate the choice of a double dynamic by invoking the variance reduction virtue of each stage of the method on both sources of noise: the index sampling for the incremental update and the MC approximation. We establish finite-time and global convergence bounds for nonconvex objective functions. Numerical applications on various models such as deformable template for image analysis or nonlinear models for pharmacokinetics are also presented to illustrate our findings.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Layer-wise and Dimension-wise Locally Adaptive Federated Learning
Authors:
Belhal Karimi,
** Li,
Xiaoyun Li
Abstract:
In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of…
▽ More
In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates. Our framework can be applied to locally adaptive FL methods including two recent algorithms, Mime and Fed-AMS. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which matches the convergence rate of state-of-the-art results in FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to the various recent adaptive FL methods.
△ Less
Submitted 23 June, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.
-
On the Convergence of Decentralized Adaptive Gradient Methods
Authors:
Xiangyi Chen,
Belhal Karimi,
Weijie Zhao,
** Li
Abstract:
Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication…
▽ More
Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching
Authors:
Farzin Haddadpour,
Belhal Karimi,
** Li,
Xiaoyun Li
Abstract:
Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution sett…
▽ More
Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution settings respectively. The key idea is to compress the accumulation of local gradients using count sketch, therefore, the server does not have access to the gradients themselves which provides privacy. Furthermore, due to the lower dimension of sketching used, our method exhibits communication-efficiency property as well. We provide, for the aforementioned schemes, sharp convergence guarantees.
Finally, we back up our theory with various set of experiments.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
On the Global Convergence of (Fast) Incremental Expectation Maximization Methods
Authors:
Belhal Karimi,
Hoi-To Wai,
Eric Moulines,
Marc Lavielle
Abstract:
The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of t…
▽ More
The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Cappé and Moulines in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of Chen et. al. in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et. al. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
An Optimistic Acceleration of AMSGrad for Nonconvex Optimization
Authors:
Jun-Kun Wang,
Xiaoyun Li,
Belhal Karimi,
** Li
Abstract:
We propose a new variant of AMSGrad, a popular adaptive gradient based optimization algorithm widely used for training deep neural networks. Our algorithm adds prior knowledge about the sequence of consecutive mini-batch gradients and leverages its underlying structure making the gradients sequentially predictable. By exploiting the predictability and ideas from optimistic online learning, the pro…
▽ More
We propose a new variant of AMSGrad, a popular adaptive gradient based optimization algorithm widely used for training deep neural networks. Our algorithm adds prior knowledge about the sequence of consecutive mini-batch gradients and leverages its underlying structure making the gradients sequentially predictable. By exploiting the predictability and ideas from optimistic online learning, the proposed algorithm can accelerate the convergence and increase sample efficiency. After establishing a tighter upper bound under some convexity conditions on the regret, we offer a complimentary view of our algorithm which generalizes the offline and stochastic version of nonconvex optimization. In the nonconvex case, we establish a non-asymptotic convergence bound independently of the initialization. We illustrate the practical speedup on several deep learning models via numerical experiments.
△ Less
Submitted 3 November, 2020; v1 submitted 4 March, 2019;
originally announced March 2019.
-
Construction of QC-LDPC Codes with Low Error Floor by Efficient Systematic Search and Elimination of Trap** Sets
Authors:
Bashirreza Karimi,
Amir. H Banihashemi
Abstract:
We propose a systematic design of protograph-based quasi-cyclic (QC) low-density parity-check (LDPC) codes with low error floor. We first characterize the trap** sets of such codes and demonstrate that the QC structure of the code eliminates some of the trap** set structures that can exist in a code with the same degree distribution and girth but lacking the QC structure. Using this characteri…
▽ More
We propose a systematic design of protograph-based quasi-cyclic (QC) low-density parity-check (LDPC) codes with low error floor. We first characterize the trap** sets of such codes and demonstrate that the QC structure of the code eliminates some of the trap** set structures that can exist in a code with the same degree distribution and girth but lacking the QC structure. Using this characterization, our design aims at eliminating a targeted collection of trap** sets. Considering the parent/child relationship between the trap** sets in the collection, we search for and eliminate those trap** sets that are in the collection but are not a child of any other trap** set in the collection. An efficient layered algorithm is designed for the search of these targeted trap** sets. Compared to the existing codes in the literature, the designed codes are superior in the sense that they are free of the same collection of trap** sets with a smaller block length, or a larger collection of trap** sets with the same block length. In addition, the efficiency of the search algorithm makes it possible to design codes with larger degrees which are free of trap** sets within larger ranges compared to the state-of-the-art.
△ Less
Submitted 17 June, 2019; v1 submitted 19 February, 2019;
originally announced February 2019.
-
Non-asymptotic Analysis of Biased Stochastic Approximation Scheme
Authors:
Belhal Karimi,
Blazej Miasojedow,
Eric Moulines,
Hoi-To Wai
Abstract:
Stochastic approximation (SA) is a key method used in statistical learning. Recently, its non-asymptotic convergence analysis has been considered in many papers. However, most of the prior analyses are made under restrictive assumptions such as unbiased gradient estimates and convex objective function, which significantly limit their applications to sophisticated tasks such as online and reinforce…
▽ More
Stochastic approximation (SA) is a key method used in statistical learning. Recently, its non-asymptotic convergence analysis has been considered in many papers. However, most of the prior analyses are made under restrictive assumptions such as unbiased gradient estimates and convex objective function, which significantly limit their applications to sophisticated tasks such as online and reinforcement learning. These restrictions are all essentially relaxed in this work. In particular, we analyze a general SA scheme to minimize a non-convex, smooth objective function. We consider update procedure whose drift term depends on a state-dependent Markov chain and the mean field is not necessarily of gradient type, covering approximate second-order method and allowing asymptotic bias for the one-step updates. We illustrate these settings with the online EM algorithm and the policy-gradient method for average reward maximization in reinforcement learning.
△ Less
Submitted 16 June, 2019; v1 submitted 1 February, 2019;
originally announced February 2019.
-
A Virtual Testbed for Critical Incident Investigation with Autonomous Remote Aerial Vehicle Surveying, Artificial Intelligence, and Decision Support
Authors:
David L. Smyth,
Sai Abinesh,
Nazli B. Karimi,
Brett Drury,
Ihsan Ullah,
Frank G. Glavin,
Michael G. Madden
Abstract:
Autonomous robotics and artificial intelligence techniques can be used to support human personnel in the event of critical incidents. These incidents can pose great danger to human life. Some examples of such assistance include: multi-robot surveying of the scene; collection of sensor data and scene imagery, real-time risk assessment and analysis; object identification and anomaly detection; and r…
▽ More
Autonomous robotics and artificial intelligence techniques can be used to support human personnel in the event of critical incidents. These incidents can pose great danger to human life. Some examples of such assistance include: multi-robot surveying of the scene; collection of sensor data and scene imagery, real-time risk assessment and analysis; object identification and anomaly detection; and retrieval of relevant supporting documentation such as standard operating procedures (SOPs). These incidents, although often rare, can involve chemical, biological, radiological/nuclear or explosive (CBRNE) substances and can be of high consequence. Real-world training and deployment of these systems can be costly and sometimes not feasible. For this reason, we have developed a realistic 3D model of a CBRNE scenario to act as a testbed for an initial set of assisting AI tools that we have developed.
△ Less
Submitted 25 January, 2019; v1 submitted 14 September, 2018;
originally announced September 2018.
-
A Virtual Environment with Multi-Robot Navigation, Analytics, and Decision Support for Critical Incident Investigation
Authors:
David L. Smyth,
James Fennell,
Sai Abinesh,
Nazli B. Karimi,
Frank G. Glavin,
Ihsan Ullah,
Brett Drury,
Michael G. Madden
Abstract:
Accidents and attacks that involve chemical, biological, radiological/nuclear or explosive (CBRNE) substances are rare, but can be of high consequence. Since the investigation of such events is not anybody's routine work, a range of AI techniques can reduce investigators' cognitive load and support decision-making, including: planning the assessment of the scene; ongoing evaluation and updating of…
▽ More
Accidents and attacks that involve chemical, biological, radiological/nuclear or explosive (CBRNE) substances are rare, but can be of high consequence. Since the investigation of such events is not anybody's routine work, a range of AI techniques can reduce investigators' cognitive load and support decision-making, including: planning the assessment of the scene; ongoing evaluation and updating of risks; control of autonomous vehicles for collecting images and sensor data; reviewing images/videos for items of interest; identification of anomalies; and retrieval of relevant documentation. Because of the rare and high-risk nature of these events, realistic simulations can support the development and evaluation of AI-based tools. We have developed realistic models of CBRNE scenarios and implemented an initial set of tools.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.