-
Sum-of-norms regularized Nonnegative Matrix Factorization
Authors:
Andersen Ang,
Waqas Bin Hamed,
Hans De Sterck
Abstract:
When applying nonnegative matrix factorization (NMF), generally the rank parameter is unknown. Such rank in NMF, called the nonnegative rank, is usually estimated heuristically since computing the exact value of it is NP-hard. In this work, we propose an approximation method to estimate such rank while solving NMF on-the-fly. We use sum-of-norm (SON), a group-lasso structure that encourages pairwi…
▽ More
When applying nonnegative matrix factorization (NMF), generally the rank parameter is unknown. Such rank in NMF, called the nonnegative rank, is usually estimated heuristically since computing the exact value of it is NP-hard. In this work, we propose an approximation method to estimate such rank while solving NMF on-the-fly. We use sum-of-norm (SON), a group-lasso structure that encourages pairwise similarity, to reduce the rank of a factor matrix where the rank is overestimated at the beginning. On various datasets, SON-NMF is able to reveal the correct nonnegative rank of the data without any prior knowledge nor tuning.
SON-NMF is a nonconvx nonsmmoth non-separable non-proximable problem, solving it is nontrivial. First, as rank estimation in NMF is NP-hard, the proposed approach does not enjoy a lower computational complexity. Using a graph-theoretic argument, we prove that the complexity of the SON-NMF is almost irreducible. Second, the per-iteration cost of any algorithm solving SON-NMF is possibly high, which motivated us to propose a first-order BCD algorithm to approximately solve SON-NMF with a low per-iteration cost, in which we do so by the proximal average operator. Lastly, we propose a simple greedy method for post-processing.
SON-NMF exhibits favourable features for applications. Beside the ability to automatically estimate the rank from data, SON-NMF can deal with rank-deficient data matrix, can detect weak component with small energy. Furthermore, on the application of hyperspectral imaging, SON-NMF handle the issue of spectral variability naturally.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
First-order PDES for Graph Neural Networks: Advection And Burgers Equation Models
Authors:
Yifan Qu,
Oliver Krzysik,
Hans De Sterck,
Omer Ege Kara
Abstract:
Graph Neural Networks (GNNs) have established themselves as the preferred methodology in a multitude of domains, ranging from computer vision to computational biology, especially in contexts where data inherently conform to graph structures. While many existing methods have endeavored to model GNNs using various techniques, a prevalent challenge they grapple with is the issue of over-smoothing. Th…
▽ More
Graph Neural Networks (GNNs) have established themselves as the preferred methodology in a multitude of domains, ranging from computer vision to computational biology, especially in contexts where data inherently conform to graph structures. While many existing methods have endeavored to model GNNs using various techniques, a prevalent challenge they grapple with is the issue of over-smoothing. This paper presents new Graph Neural Network models that incorporate two first-order Partial Differential Equations (PDEs). These models do not increase complexity but effectively mitigate the over-smoothing problem. Our experimental findings highlight the capacity of our new PDE model to achieve comparable results with higher-order PDE models and fix the over-smoothing problem up to 64 layers. These results underscore the adaptability and versatility of GNNs, indicating that unconventional approaches can yield outcomes on par with established techniques.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences
Authors:
Yanming Kang,
Giang Tran,
Hans De Sterck
Abstract:
Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexit…
▽ More
Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.
△ Less
Submitted 20 October, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Downlink Compression Improves TopK Sparsification
Authors:
William Zou,
Hans De Sterck,
Jun Liu
Abstract:
Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the lar…
▽ More
Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not only does it reduce the amount of communication per step, but also, counter-intuitively, can improve the upper bound in the convergence analysis. To show this, we revisit non-convex convergence analysis of topK stochastic gradient descent (SGD) and extend it from the unidirectional to the bidirectional setting. We also remove a restriction of the previous analysis that requires unrealistically large values of K. We experimentally evaluate bidirectional topK SGD against unidirectional topK SGD and show that models trained with bidirectional topK SGD will perform as well as models trained with unidirectional topK SGD while yielding significant communication benefits for large numbers of workers.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Neural Lyapunov Control of Unknown Nonlinear Systems with Stability Guarantees
Authors:
Ruikun Zhou,
Thanin Quartz,
Hans De Sterck,
Jun Liu
Abstract:
Learning for control of dynamical systems with formal guarantees remains a challenging task. This paper proposes a learning framework to simultaneously stabilize an unknown nonlinear system with a neural controller and learn a neural Lyapunov function to certify a region of attraction (ROA) for the closed-loop system. The algorithmic structure consists of two neural networks and a satisfiability m…
▽ More
Learning for control of dynamical systems with formal guarantees remains a challenging task. This paper proposes a learning framework to simultaneously stabilize an unknown nonlinear system with a neural controller and learn a neural Lyapunov function to certify a region of attraction (ROA) for the closed-loop system. The algorithmic structure consists of two neural networks and a satisfiability modulo theories (SMT) solver. The first neural network is responsible for learning the unknown dynamics. The second neural network aims to identify a valid Lyapunov function and a provably stabilizing nonlinear controller. The SMT solver then verifies that the candidate Lyapunov function indeed satisfies the Lyapunov conditions. We provide theoretical guarantees of the proposed learning framework in terms of the closed-loop stability for the unknown nonlinear system. We illustrate the effectiveness of the approach with a set of numerical experiments.
△ Less
Submitted 15 October, 2022; v1 submitted 4 June, 2022;
originally announced June 2022.
-
Anderson Acceleration as a Krylov Method with Application to Asymptotic Convergence Analysis
Authors:
Hans De Sterck,
Yunhui He,
Oliver A. Krzysik
Abstract:
Anderson acceleration (AA) is widely used for accelerating the convergence of nonlinear fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in \mathbb{R}^n$, but little is known about how to quantify the convergence acceleration provided by AA. As a roadway towards gaining more understanding of convergence acceleration by AA, we study AA($m$), i.e., Anderson acceleration with finite window size $m$, app…
▽ More
Anderson acceleration (AA) is widely used for accelerating the convergence of nonlinear fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in \mathbb{R}^n$, but little is known about how to quantify the convergence acceleration provided by AA. As a roadway towards gaining more understanding of convergence acceleration by AA, we study AA($m$), i.e., Anderson acceleration with finite window size $m$, applied to the case of linear fixed-point iterations $x_{k+1}=M x_{k}+b$. We write AA($m$) as a Krylov method with polynomial residual update formulas, and derive recurrence relations for the AA($m$) polynomials. Writing AA($m$) as a Krylov method immediately implies that $k$ iterations of AA($m$) cannot produce a smaller residual than $k$ iterations of GMRES without restart (but without implying anything about the relative convergence speed of (windowed) AA($m$) versus restarted GMRES($m$)). We find that the AA($m$) residual polynomials observe a periodic memory effect where increasing powers of the error iteration matrix $M$ act on the initial residual as the iteration number increases. We derive several further results based on these polynomial residual update formulas, including orthogonality relations, a lower bound on the AA(1) acceleration coefficient $β_k$, and explicit nonlinear recursions for the AA(1) residuals and residual polynomials that do not include the acceleration coefficient $β_k$. Using these recurrence relations we also derive new residual convergence bounds for AA(1) in the linear case, demonstrating how the per-iteration residual reduction $||r_{k+1}||/||r_{k}||$ depends strongly on the residual reduction in the previous iteration and on the angle between the prior residual vectors $r_k$ and $r_{k-1}$. We apply these results to study the influence of the initial guess on the asymptotic convergence factor of AA(1), and to study AA(1) residual convergence patterns.
△ Less
Submitted 23 February, 2023; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Linear Asymptotic Convergence of Anderson Acceleration: Fixed-Point Analysis
Authors:
Hans De Sterck,
Yunhui He
Abstract:
We study the asymptotic convergence of AA($m$), i.e., Anderson acceleration with window size $m$ for accelerating fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in R^n$. Convergence acceleration by AA($m$) has been widely observed but is not well understood. We consider the case where the fixed-point iteration function $q(x)$ is differentiable and the convergence of the fixed-point method itself is…
▽ More
We study the asymptotic convergence of AA($m$), i.e., Anderson acceleration with window size $m$ for accelerating fixed-point methods $x_{k+1}=q(x_{k})$, $x_k \in R^n$. Convergence acceleration by AA($m$) has been widely observed but is not well understood. We consider the case where the fixed-point iteration function $q(x)$ is differentiable and the convergence of the fixed-point method itself is root-linear. We identify numerically several conspicuous properties of AA($m$) convergence: First, AA($m$) sequences $\{x_k\}$ converge root-linearly but the root-linear convergence factor depends strongly on the initial condition. Second, the AA($m$) acceleration coefficients $β^{(k)}$ do not converge but oscillate as $\{x_k\}$ converges to $x^*$. To shed light on these observations, we write the AA($m$) iteration as an augmented fixed-point iteration $z_{k+1} =Ψ(z_k)$, $z_k \in R^{n(m+1)}$ and analyze the continuity and differentiability properties of $Ψ(z)$ and $β(z)$. We find that the vector of acceleration coefficients $β(z)$ is not continuous at the fixed point $z^*$. However, we show that, despite the discontinuity of $β(z)$, the iteration function $Ψ(z)$ is Lipschitz continuous and directionally differentiable at $z^*$ for AA(1), and we generalize this to AA($m$) with $m>1$ for most cases. Furthermore, we find that $Ψ(z)$ is not differentiable at $z^*$. We then discuss how these theoretical findings relate to the observed convergence behaviour of AA($m$). The discontinuity of $β(z)$ at $z^*$ allows $β^{(k)}$ to oscillate as $\{x_k\}$ converges to $x^*$, and the non-differentiability of $Ψ(z)$ allows AA($m$) sequences to converge with root-linear convergence factors that strongly depend on the initial condition. Additional numerical results illustrate our findings.
△ Less
Submitted 2 May, 2022; v1 submitted 28 September, 2021;
originally announced September 2021.
-
N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations
Authors:
Aaron Baier-Reinio,
Hans De Sterck
Abstract:
We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer…
▽ More
We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
On the Asymptotic Linear Convergence Speed of Anderson Acceleration Applied to ADMM
Authors:
Dawei Wang,
Yunhui He,
Hans De Sterck
Abstract:
Empirical results show that Anderson acceleration (AA) can be a powerful mechanism to improve the asymptotic linear convergence speed of the Alternating Direction Method of Multipliers (ADMM) when ADMM by itself converges linearly. However, theoretical results to quantify this improvement do not exist yet. In this paper we explain and quantify this improvement in linear asymptotic convergence spee…
▽ More
Empirical results show that Anderson acceleration (AA) can be a powerful mechanism to improve the asymptotic linear convergence speed of the Alternating Direction Method of Multipliers (ADMM) when ADMM by itself converges linearly. However, theoretical results to quantify this improvement do not exist yet. In this paper we explain and quantify this improvement in linear asymptotic convergence speed for the special case of a stationary version of AA applied to ADMM. We do so by considering the spectral properties of the Jacobians of ADMM and the stationary version of AA evaluated at the fixed point, where the coefficients of the stationary AA method are computed such that its asymptotic linear convergence factor is optimal. The optimal linear convergence factors of this stationary AA-ADMM method are computed analytically or by optimization, based on previous work on optimal stationary AA acceleration. Using this spectral picture and those analytical results, our approach provides new insight into how and by how much the stationary AA method can improve the asymptotic linear convergence factor of ADMM. Numerical results also indicate that the optimal linear convergence factor of the stationary AA methods gives a useful estimate for the asymptotic linear convergence speed of the non-stationary AA method that is used in practice.
△ Less
Submitted 29 November, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
On the Asymptotic Linear Convergence Speed of Anderson Acceleration, Nesterov Acceleration, and Nonlinear GMRES
Authors:
Hans De Sterck,
Yunhui He
Abstract:
We consider nonlinear convergence acceleration methods for fixed-point iteration $x_{k+1}=q(x_k)$, including Anderson acceleration (AA), nonlinear GMRES (NGMRES), and Nesterov-type acceleration (corresponding to AA with window size one). We focus on fixed-point methods that converge asymptotically linearly with convergence factor $ρ<1$ and that solve an underlying fully smooth and non-convex optim…
▽ More
We consider nonlinear convergence acceleration methods for fixed-point iteration $x_{k+1}=q(x_k)$, including Anderson acceleration (AA), nonlinear GMRES (NGMRES), and Nesterov-type acceleration (corresponding to AA with window size one). We focus on fixed-point methods that converge asymptotically linearly with convergence factor $ρ<1$ and that solve an underlying fully smooth and non-convex optimization problem. It is often observed that AA and NGMRES substantially improve the asymptotic convergence behavior of the fixed-point iteration, but this improvement has not been quantified theoretically. We investigate this problem under simplified conditions. First, we consider stationary versions of AA and NGMRES, and determine coefficients that result in optimal asymptotic convergence factors, given knowledge of the spectrum of $q'(x)$ at the fixed point $x^*$. This allows us to understand and quantify the asymptotic convergence improvement that can be provided by nonlinear convergence acceleration, viewing $x_{k+1}=q(x_k)$ as a nonlinear preconditioner for AA and NGMRES. Second, for the case of infinite window size, we consider linear asymptotic convergence bounds for GMRES applied to the fixed-point iteration linearized about $x^*$. Since AA and NGMRES are equivalent to GMRES in the linear case, one may expect the GMRES convergence factors to be relevant for AA and NGMRES as $x_k \rightarrow x^*$. Our results are illustrated numerically for a class of test problems from canonical tensor decomposition, comparing steepest descent and alternating least squares (ALS) as the fixed-point iterations that are accelerated by AA and NGMRES. Our numerical tests show that both approaches allow us to estimate asymptotic convergence speed for nonstationary AA and NGMRES with finite window size.
△ Less
Submitted 7 November, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Nesterov Acceleration of Alternating Least Squares for Canonical Tensor Decomposition: Momentum Step Size Selection and Restart Mechanisms
Authors:
Drew Mitchell,
Nan Ye,
Hans De Sterck
Abstract:
We present Nesterov-type acceleration techniques for Alternating Least Squares (ALS) methods applied to canonical tensor decomposition. While Nesterov acceleration turns gradient descent into an optimal first-order method for convex problems by adding a momentum term with a specific weight sequence, a direct application of this method and weight sequence to ALS results in erratic convergence behav…
▽ More
We present Nesterov-type acceleration techniques for Alternating Least Squares (ALS) methods applied to canonical tensor decomposition. While Nesterov acceleration turns gradient descent into an optimal first-order method for convex problems by adding a momentum term with a specific weight sequence, a direct application of this method and weight sequence to ALS results in erratic convergence behaviour. This is so because the tensor decomposition problem is non-convex and ALS is accelerated instead of gradient descent. Instead, we consider various restart mechanisms and suitable choices of momentum weights that enable effective acceleration. Our extensive empirical results show that the Nesterov-accelerated ALS methods with restart can be dramatically more efficient than the stand-alone ALS or Nesterov accelerated gradient methods, when problems are ill-conditioned or accurate solutions are desired. The resulting methods perform competitively with or superior to existing acceleration methods for ALS, including ALS acceleration by NCG, NGMRES, or LBFGS, and additionally enjoy the benefit of being much easier to implement. We also compare with Nesterov-type updates where the momentum weight is determined by a line search, which are equivalent or closely related to existing line search methods for ALS. On a large and ill-conditioned 71$\times$1000$\times$900 tensor consisting of readings from chemical sensors to track hazardous gases, the restarted Nesterov-ALS method shows desirable robustness properties and outperforms any of the existing methods by a large factor. There is clear potential for extending our Nesterov-type acceleration approach to accelerating other optimization algorithms than ALS applied to other non-convex problems, such as Tucker tensor decomposition. Our Matlab code is available at https://github.com/hansdesterck/nonlinear-preconditioning-for-optimization.
△ Less
Submitted 30 November, 2019; v1 submitted 13 October, 2018;
originally announced October 2018.
-
Research and Education in Computational Science and Engineering
Authors:
Ulrich Rüde,
Karen Willcox,
Lois Curfman McInnes,
Hans De Sterck,
George Biros,
Hans Bungartz,
James Corones,
Evin Cramer,
James Crowley,
Omar Ghattas,
Max Gunzburger,
Michael Hanke,
Robert Harrison,
Michael Heroux,
Jan Hesthaven,
Peter Jimack,
Chris Johnson,
Kirk E. Jordan,
David E. Keyes,
Rolf Krause,
Vipin Kumar,
Stefan Mayer,
Juan Meza,
Knut Martin Mørken,
J. Tinsley Oden
, et al. (8 additional authors not shown)
Abstract:
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that…
▽ More
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.
△ Less
Submitted 31 December, 2017; v1 submitted 8 October, 2016;
originally announced October 2016.
-
The Statistical Mechanics of Human Weight Change
Authors:
John C. Lang,
Hans De Sterck,
Daniel M. Abrams
Abstract:
In the context of the global obesity epidemic, it is important to know who becomes obese and why. However, the processes that determine the changing shape of Body Mass Index (BMI) distributions in high-income societies are not well-understood. Here we establish the statistical mechanics of human weight change, providing a fundamental new understanding of human weight distributions. By compiling an…
▽ More
In the context of the global obesity epidemic, it is important to know who becomes obese and why. However, the processes that determine the changing shape of Body Mass Index (BMI) distributions in high-income societies are not well-understood. Here we establish the statistical mechanics of human weight change, providing a fundamental new understanding of human weight distributions. By compiling and analysing the largest data set so far of year-over-year BMI changes, we find, strikingly, that heavy people on average strongly decrease their weight year-over-year, and light people increase their weight. This drift towards the centre of the BMI distribution is balanced by diffusion resulting from random fluctuations in diet and physical activity that are, notably, proportional in size to BMI. We formulate a stochastic mathematical model for BMI dynamics, deriving a theoretical shape for the BMI distribution and offering a mechanism to explain the ongoing right-skewed broadening of BMI distributions over time. The model also provides new quantitative support for the hypothesis that peer-to-peer social influence plays a measurable role in BMI dynamics. More broadly, our results demonstrate a remarkable analogy with drift-diffusion mechanisms that are well-known from the physical sciences and finance.
△ Less
Submitted 29 September, 2016;
originally announced October 2016.
-
GeoTextTagger: High-Precision Location Tagging of Textual Documents using a Natural Language Processing Approach
Authors:
Shawn Brunsting,
Hans De Sterck,
Remco Dolman,
Teun van Sprundel
Abstract:
Location tagging, also known as geotagging or geolocation, is the process of assigning geographical coordinates to input data. In this paper we present an algorithm for location tagging of textual documents. Our approach makes use of previous work in natural language processing by using a state-of-the-art part-of-speech tagger and named entity recognizer to find blocks of text which may refer to l…
▽ More
Location tagging, also known as geotagging or geolocation, is the process of assigning geographical coordinates to input data. In this paper we present an algorithm for location tagging of textual documents. Our approach makes use of previous work in natural language processing by using a state-of-the-art part-of-speech tagger and named entity recognizer to find blocks of text which may refer to locations. A knowledge base (OpenStreatMap) is then used to find a list of possible locations for each block. Finally, one location is chosen for each block by assigning distance-based scores to each location and repeatedly selecting the location and block with the best score. We tested our geolocation algorithm with Wikipedia articles about topics with a well-defined geographical location that are geotagged by the articles' authors, where classification approaches have achieved median errors as low as 11 km, with attainable accuracy limited by the class size. Our approach achieved a 10th percentile error of 490 metres and median error of 54 kilometres on the Wikipedia dataset we used. When considering the five location tags with the greatest scores, 50% of articles were assigned at least one tag within 8.5 kilometres of the article's author-assigned true location. We also tested our approach on Twitter messages that are tagged with the location from which the message was sent. Twitter texts are challenging because they are short and unstructured and often do not contain words referring to the location they were sent from, but we obtain potentially useful results. We explain how we use the Spark framework for data analytics to collect and process our test data. In general, classification-based approaches for location tagging may be reaching their upper accuracy limit, but our precision-focused approach has high accuracy for some texts and shows significant potential for improvement overall.
△ Less
Submitted 22 January, 2016;
originally announced January 2016.
-
A polynomial expansion line search for large-scale unconstrained minimization of smooth L2-regularized loss functions, with implementation in Apache Spark
Authors:
Michael B Hynes,
Hans De Sterck
Abstract:
In large-scale unconstrained optimization algorithms such as limited memory BFGS (LBFGS), a common subproblem is a line search minimizing the loss function along a descent direction. Commonly used line searches iteratively find an approximate solution for which the Wolfe conditions are satisfied, typically requiring multiple function and gradient evaluations per line search, which is expensive in…
▽ More
In large-scale unconstrained optimization algorithms such as limited memory BFGS (LBFGS), a common subproblem is a line search minimizing the loss function along a descent direction. Commonly used line searches iteratively find an approximate solution for which the Wolfe conditions are satisfied, typically requiring multiple function and gradient evaluations per line search, which is expensive in parallel due to communication requirements. In this paper we propose a new line search approach for cases where the loss function is analytic, as in least squares regression, logistic regression, or low rank matrix factorization. We approximate the loss function by a truncated Taylor polynomial, whose coefficients may be computed efficiently in parallel with less communication than evaluating the gradient, after which this polynomial may be minimized with high accuracy in a neighbourhood of the expansion point. Our Polynomial Expansion Line Search (PELS) was implemented in the Apache Spark framework and used to accelerate the training of a logistic regression model on binary classification datasets from the LIBSVM repository with LBFGS and the Nonlinear Conjugate Gradient (NCG) method. In large-scale numerical experiments in parallel on a 16-node cluster with 256 cores using the URL, KDDA, and KDDB datasets, the PELS approach produced significant convergence improvements compared to the use of classical Wolfe line searches. For example, to reach the final training label prediction accuracies, LBFGS using PELS had speedup factors of 1.8--2 over LBFGS using a Wolfe line search, measured by both the number of iterations and the time required, due to the better accuracy of step sizes computed in the line search. PELS has the potential to significantly accelerate large-scale regression and factorization computations, and is applicable to continuous optimization problems with smooth loss functions.
△ Less
Submitted 26 January, 2016; v1 submitted 28 October, 2015;
originally announced October 2015.
-
Algorithmic Acceleration of Parallel ALS for Collaborative Filtering: Speeding up Distributed Big Data Recommendation in Spark
Authors:
Manda Winlaw,
Michael B. Hynes,
Anthony Caterini,
Hans De Sterck
Abstract:
Collaborative filtering algorithms are important building blocks in many practical recommendation systems. For example, many large-scale data processing environments include collaborative filtering models for which the Alternating Least Squares (ALS) algorithm is used to compute latent factor matrix decompositions. In this paper, we propose an approach to accelerate the convergence of parallel ALS…
▽ More
Collaborative filtering algorithms are important building blocks in many practical recommendation systems. For example, many large-scale data processing environments include collaborative filtering models for which the Alternating Least Squares (ALS) algorithm is used to compute latent factor matrix decompositions. In this paper, we propose an approach to accelerate the convergence of parallel ALS-based optimization methods for collaborative filtering using a nonlinear conjugate gradient (NCG) wrapper around the ALS iterations. We also provide a parallel implementation of the accelerated ALS-NCG algorithm in the Apache Spark distributed data processing environment, and an efficient line search technique as part of the ALS-NCG implementation that requires only one pass over the data on distributed datasets. In serial numerical experiments on a linux workstation and parallel numerical experiments on a 16 node cluster with 256 computing cores, we demonstrate that the combined ALS-NCG method requires many fewer iterations and less time than standalone ALS to reach movie rankings with high accuracy on the MovieLens 20M dataset. In parallel, ALS-NCG can achieve an acceleration factor of 4 or greater in clock time when an accurate solution is desired; furthermore, the acceleration factor increases as greater numerical precision is required in the solution. In addition, the NCG acceleration mechanism is efficient in parallel and scales linearly with problem size on synthetic datasets with up to nearly 1 billion ratings. The acceleration mechanism is general and may also be applicable to other optimization methods for collaborative filtering.
△ Less
Submitted 10 January, 2016; v1 submitted 12 August, 2015;
originally announced August 2015.
-
A Hierarchy of Linear Threshold Models for the Spread of Political Revolutions on Social Networks
Authors:
John C. Lang,
Hans De Sterck
Abstract:
We study a linear threshold agent-based model (ABM) for the spread of political revolutions on social networks using empirical network data. We propose new techniques for building a hierarchy of simplified ordinary differential equation (ODE) based models that aim to capture essential features of the ABM, including effects of the actual networks, and give insight in the parameter regime transition…
▽ More
We study a linear threshold agent-based model (ABM) for the spread of political revolutions on social networks using empirical network data. We propose new techniques for building a hierarchy of simplified ordinary differential equation (ODE) based models that aim to capture essential features of the ABM, including effects of the actual networks, and give insight in the parameter regime transitions of the ABM. We relate the ABM and the hierarchy of models to a population-level compartmental ODE model that we proposed previously for the spread of political revolutions [1], which is shown to be mathematically consistent with the proposed ABM and provides a way to analyze the global behaviour of the ABM. This consistency with the linear threshold ABM also provides further justification a posteriori for the compartmental model of [1]. Extending concepts from epidemiological modelling, we define a basic reproduction number $R_0$ for the linear threshold ABM and apply it to predict ABM behaviour on empirical networks. In small-scale numerical tests we investigate experimentally the differences in spreading behaviour that occur under the linear threshold ABM model when applied to some empirical online and offline social networks, searching for quantitative evidence that political revolutions may be facilitated by the modern online social networks of social media.
△ Less
Submitted 16 January, 2015;
originally announced January 2015.
-
The influence of societal individualism on a century of tobacco use: modelling the prevalence of smoking
Authors:
John C. Lang,
Daniel M. Abrams,
Hans De Sterck
Abstract:
Smoking of tobacco is predicted to cause approximately six million deaths worldwide in 2014. Responding effectively to this epidemic requires a thorough understanding of how smoking behaviour is transmitted and modified. Here, we present a new mathematical model of the social dynamics that cause cigarette smoking to spread in a population. Our model predicts that more individualistic societies wil…
▽ More
Smoking of tobacco is predicted to cause approximately six million deaths worldwide in 2014. Responding effectively to this epidemic requires a thorough understanding of how smoking behaviour is transmitted and modified. Here, we present a new mathematical model of the social dynamics that cause cigarette smoking to spread in a population. Our model predicts that more individualistic societies will show faster adoption and cessation of smoking. Evidence from a new century-long composite data set on smoking prevalence in 25 countries supports the model, with direct implications for public health interventions around the world. Our results suggest that differences in culture between societies can measurably affect the temporal dynamics of a social spreading process, and that these effects can be understood via a quantitative mathematical model matched to observations.
△ Less
Submitted 8 July, 2014;
originally announced July 2014.
-
The Arab Spring: A Simple Compartmental Model for the Dynamics of a Revolution
Authors:
John Lang,
Hans De Sterck
Abstract:
The self-immolation of Mohamed Bouazizi on December 17, 2011 in the small Tunisian city of Sidi Bouzid, set off a sequence of events culminating in the revolutions of the Arab Spring. It is widely believed that the Internet and social media played a critical role in the growth and success of protests that led to the downfall of the regimes in Egypt and Tunisia. However, the precise mechanisms by w…
▽ More
The self-immolation of Mohamed Bouazizi on December 17, 2011 in the small Tunisian city of Sidi Bouzid, set off a sequence of events culminating in the revolutions of the Arab Spring. It is widely believed that the Internet and social media played a critical role in the growth and success of protests that led to the downfall of the regimes in Egypt and Tunisia. However, the precise mechanisms by which these new media affected the course of events remain unclear. We introduce a simple compartmental model for the dynamics of a revolution in a dictatorial regime such as Tunisia or Egypt which takes into account the role of the Internet and social media. An elementary mathematical analysis of the model identifies four main parameter regions: stable police state, meta-stable police state, unstable police state, and failed state. We illustrate how these regions capture, at least qualitatively, a wide range of scenarios observed in the context of revolutionary movements by considering the revolutions in Tunisia and Egypt, as well as the situation in Iran, China, and Somalia, as case studies. We pose four questions about the dynamics of the Arab Spring revolutions and formulate answers informed by the model. We conclude with some possible directions for future work.
△ Less
Submitted 5 October, 2012;
originally announced October 2012.