Search | arXiv e-print repository

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Authors: Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, Shen Li

Abstract: It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit tech… ▽ More It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS. △ Less

Submitted 12 September, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:2205.11467 [pdf, other]

A Question-Answer Driven Approach to Reveal Affirmative Interpretations from Verbal Negations

Authors: Md Mosharaf Hossain, Luke Holman, Anusha Kakileti, Tiffany Iris Kao, Nathan Raul Brito, Aaron Abraham Mathews, Eduardo Blanco

Abstract: This paper explores a question-answer driven approach to reveal affirmative interpretations from verbal negations (i.e., when a negation cue grammatically modifies a verb). We create a new corpus consisting of 4,472 verbal negations and discover that 67.1% of them convey that an event actually occurred. Annotators generate and answer 7,277 questions for the 3,001 negations that convey an affirmati… ▽ More This paper explores a question-answer driven approach to reveal affirmative interpretations from verbal negations (i.e., when a negation cue grammatically modifies a verb). We create a new corpus consisting of 4,472 verbal negations and discover that 67.1% of them convey that an event actually occurred. Annotators generate and answer 7,277 questions for the 3,001 negations that convey an affirmative interpretation. We first cast the problem of revealing affirmative interpretations from negations as a natural language inference (NLI) classification task. Experimental results show that state-of-the-art transformers trained with existing NLI corpora are insufficient to reveal affirmative interpretations. We also observe, however, that fine-tuning brings small improvements. In addition to NLI classification, we also explore the more realistic task of generating affirmative interpretations directly from negations with the T5 transformer. We conclude that the generation task remains a challenge as T5 substantially underperforms humans. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted at the Findings of NAACL 2022

arXiv:2205.07838 [pdf, other]

Physics-informed machine learning techniques for edge plasma turbulence modelling in computational theory and experiment

Authors: Abhilash Mathews

Abstract: Edge plasma turbulence is critical to the performance of magnetic confinement fusion devices. Towards better understanding edge turbulence in both theory and experiment, a custom-built physics-informed deep learning framework constrained by partial differential equations is developed to accurately learn turbulent fields consistent with the two-fluid theory from partial observations of electron pre… ▽ More Edge plasma turbulence is critical to the performance of magnetic confinement fusion devices. Towards better understanding edge turbulence in both theory and experiment, a custom-built physics-informed deep learning framework constrained by partial differential equations is developed to accurately learn turbulent fields consistent with the two-fluid theory from partial observations of electron pressure. This calculation is not otherwise possible using conventional equilibrium models. With this technique, the first direct quantitative comparisons of turbulent fields between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling are demonstrated with good overall agreement found in magnetized helical plasmas at low normalized pressure. To translate these computational techniques to experimental fusion plasmas, a novel method to translate brightness measurements of HeI line radiation into local plasma fluctuations is demonstrated via a newly created deep learning framework that integrates neutral transport physics and collisional radiative theory for the $3^3 D - 2^3 P$ transition in atomic helium. Using fast camera data on the Alcator C-Mod tokamak, this thesis presents the first 2-dimensional time-dependent experimental measurements of the turbulent electron density, electron temperature, and neutral density in a fusion plasma using a single spectral line. With this experimentally inferred data, initial estimates of the 2-dimensional turbulent electric field consistent with drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field are calculated. The inclusion of atomic helium effects on particle and energy sources are found to strengthen correlations between the electric field and electron pressure while broadening turbulent field amplitudes which impact ${\bf E \times B}$ flows and shearing rates. △ Less

Submitted 16 May, 2022; originally announced May 2022.

Comments: PhD thesis, 172 pages, 38 figures, 4 tables

arXiv:2205.04235 [pdf, other]

Measuring Cognitive Workload Using Multimodal Sensors

Authors: Niraj Hirachan, Anita Mathews, Julio Romero, Raul Fernandez Rojas

Abstract: This study aims to identify a set of indicators to estimate cognitive workload using a multimodal sensing approach and machine learning. A set of three cognitive tests were conducted to induce cognitive workload in twelve participants at two levels of task difficulty (Easy and Hard). Four sensors were used to measure the participants' physiological change, including, Electrocardiogram (ECG), elect… ▽ More This study aims to identify a set of indicators to estimate cognitive workload using a multimodal sensing approach and machine learning. A set of three cognitive tests were conducted to induce cognitive workload in twelve participants at two levels of task difficulty (Easy and Hard). Four sensors were used to measure the participants' physiological change, including, Electrocardiogram (ECG), electrodermal activity (EDA), respiration (RESP), and blood oxygen saturation (SpO2). To understand the perceived cognitive workload, NASA-TLX was used after each test and analysed using Chi-Square test. Three well-know classifiers (LDA, SVM, and DT) were trained and tested independently using the physiological data. The statistical analysis showed that participants' perceived cognitive workload was significantly different (p<0.001) between the tests, which demonstrated the validity of the experimental conditions to induce different cognitive levels. Classification results showed that a fusion of ECG and EDA presented good discriminating power (acc=0.74) for cognitive workload detection. This study provides preliminary results in the identification of a possible set of indicators of cognitive workload. Future work needs to be carried out to validate the indicators using more realistic scenarios and with a larger population. △ Less

Submitted 5 May, 2022; originally announced May 2022.

arXiv:2204.11689 [pdf, other]

doi 10.1103/PhysRevLett.129.235002

Deep electric field predictions by drift-reduced Braginskii theory with plasma-neutral interactions based upon experimental images of boundary turbulence

Authors: Abhilash Mathews, Jerry Hughes, James Terry, Seung-Gyou Baek

Abstract: We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature on open field lines obtained from analysis of gas puff imaging of a discharge on the A… ▽ More We present 2-dimensional turbulent electric field calculations via physics-informed deep learning consistent with (i) drift-reduced Braginskii theory under the framework of an axisymmetric fusion plasma with purely toroidal field and (ii) experimental estimates of the fluctuating electron density and temperature on open field lines obtained from analysis of gas puff imaging of a discharge on the Alcator C-Mod tokamak. The inclusion of effects from the locally puffed atomic helium on particle and energy sources within the reduced plasma turbulence model are found to strengthen correlations between the electric field and electron pressure. The neutrals are also directly associated with broadening the distribution of turbulent field amplitudes and increasing ${\bf E \times B}$ shearing rates. This demonstrates a novel approach in plasma experiments by solving for nonlinear dynamics consistent with partial differential equations and data without encoding explicit boundary nor initial conditions. △ Less

Submitted 28 November, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

Comments: 6 pages, 3 figures, 2 tables

arXiv:2112.03256 [pdf, ps, other]

Impact of Target Word and Context on End-to-End Metonymy Detection

Authors: Kevin Alex Mathews, Michael Strube

Abstract: Metonymy is a figure of speech in which an entity is referred to by another related entity. The task of metonymy detection aims to distinguish metonymic tokens from literal ones. Until now, metonymy detection methods attempt to disambiguate only a single noun phrase in a sentence, typically location names or organization names. In this paper, we disambiguate every word in a sentence by reformulati… ▽ More Metonymy is a figure of speech in which an entity is referred to by another related entity. The task of metonymy detection aims to distinguish metonymic tokens from literal ones. Until now, metonymy detection methods attempt to disambiguate only a single noun phrase in a sentence, typically location names or organization names. In this paper, we disambiguate every word in a sentence by reformulating metonymy detection as a sequence labeling task. We also investigate the impact of target word and context on metonymy detection. We show that the target word is less useful for detecting metonymy in our dataset. On the other hand, the entity types that are associated with domain-specific words in their context are easier to solve. This shows that the context words are much more relevant for detecting metonymy. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.13802 [pdf, other]

Factorized Fourier Neural Operators

Authors: Alasdair Tran, Alexander Mathews, Lexing Xie, Cheng Soon Ong

Abstract: We propose the Factorized Fourier Neural Operator (F-FNO), a learning-based approach for simulating partial differential equations (PDEs). Starting from a recently proposed Fourier representation of flow fields, the F-FNO bridges the performance gap between pure machine learning approaches to that of the best numerical or hybrid solvers. This is achieved with new representations - separable spectr… ▽ More We propose the Factorized Fourier Neural Operator (F-FNO), a learning-based approach for simulating partial differential equations (PDEs). Starting from a recently proposed Fourier representation of flow fields, the F-FNO bridges the performance gap between pure machine learning approaches to that of the best numerical or hybrid solvers. This is achieved with new representations - separable spectral layers and improved residual connections - and a combination of training strategies such as the Markov assumption, Gaussian noise, and cosine learning rate decay. On several challenging benchmark PDEs on regular grids, structured meshes, and point clouds, the F-FNO can scale to deeper networks and outperform both the FNO and the geo-FNO, reducing the error by 83% on the Navier-Stokes problem, 31% on the elasticity problem, 57% on the airfoil flow problem, and 60% on the plastic forging problem. Compared to the state-of-the-art pseudo-spectral method, the F-FNO can take a step size that is an order of magnitude larger in time and achieve an order of magnitude speedup to produce the same solution quality. △ Less

Submitted 2 March, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

Comments: Published in The Eleventh International Conference on Learning Representations (2023). Code is available at https://github.com/alasdairtran/fourierflow

arXiv:2107.09744 [pdf, other]

doi 10.1063/5.0066064

Turbulent field fluctuations in gyrokinetic and fluid plasmas

Authors: Abhilash Mathews, Noah Mandell, Manaure Francisquez, Jerry Hughes, Ammar Hakim

Abstract: A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but sig… ▽ More A key uncertainty in the design and development of magnetic confinement fusion energy reactors is predicting edge plasma turbulence. An essential step in overcoming this uncertainty is the validation in accuracy of reduced turbulent transport models. Drift-reduced Braginskii two-fluid theory is one such set of reduced equations that has for decades simulated boundary plasmas in experiment, but significant questions exist regarding its predictive ability. To this end, using a novel physics-informed deep learning framework, we demonstrate the first ever direct quantitative comparisons of turbulent field fluctuations between electrostatic two-fluid theory and electromagnetic gyrokinetic modelling with good overall agreement found in magnetized helical plasmas at low normalized pressure. This framework is readily adaptable to experimental and astrophysical environments, and presents a new technique for the numerical validation and discovery of reduced global plasma turbulence models. △ Less

Submitted 6 October, 2021; v1 submitted 20 July, 2021; originally announced July 2021.

Comments: 13 pages, 5 figures

arXiv:2107.04140 [pdf, other]

First-Generation Inference Accelerator Deployment at Facebook

Authors: Michael Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, Haixin Liu, Yinghai Lu, Jack Montgomery, Arun Moorthy, Satish Nadathur, Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin Schatz, Narayanan Sundaram, Bangsheng Tang, Peter Tang, Amy Yang, Jiecao Yu , et al. (90 additional authors not shown)

Abstract: In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the in… ▽ More In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a single-socket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs. We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design. △ Less

Submitted 4 August, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

arXiv:2104.05158 [pdf, other]

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Authors: Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng , et al. (28 additional authors not shown)

Abstract: Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pa… ▽ More Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) develo** sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments. △ Less

Submitted 26 February, 2023; v1 submitted 11 April, 2021; originally announced April 2021.

arXiv:2102.07289 [pdf, other]

doi 10.1145/3442381.3449945

Radflow: A Recurrent, Aggregated, and Decomposable Model for Networks of Time Series

Authors: Alasdair Tran, Alexander Mathews, Cheng Soon Ong, Lexing Xie

Abstract: We propose a new model for networks of time series that influence each other. Graph structures among time series are found in diverse domains, such as web traffic influenced by hyperlinks, product sales influenced by recommendation, or urban transport volume influenced by road networks and weather. There has been recent progress in graph modeling and in time series forecasting, respectively, but a… ▽ More We propose a new model for networks of time series that influence each other. Graph structures among time series are found in diverse domains, such as web traffic influenced by hyperlinks, product sales influenced by recommendation, or urban transport volume influenced by road networks and weather. There has been recent progress in graph modeling and in time series forecasting, respectively, but an expressive and scalable approach for a network of series does not yet exist. We introduce Radflow, a novel model that embodies three key ideas: a recurrent neural network to obtain node embeddings that depend on time, the aggregation of the flow of influence from neighboring nodes with multi-head attention, and the multi-layer decomposition of time series. Radflow naturally takes into account dynamic networks where nodes and edges change over time, and it can be used for prediction and data imputation tasks. On real-world datasets ranging from a few hundred to a few hundred thousand nodes, we observe that Radflow variants are the best performing model across a wide range of settings. The recurrent component in Radflow also outperforms N-BEATS, the state-of-the-art time series model. We show that Radflow can learn different trends and seasonal patterns, that it is robust to missing nodes and edges, and that correlated temporal patterns among network neighbors reflect influence strength. We curate WikiTraffic, the largest dynamic network of time series with 366K nodes and 22M time-dependent links spanning five years. This dataset provides an open benchmark for develo** models in this area, with applications that include optimizing resources for the web. More broadly, Radflow has the potential to improve forecasts in correlated time series networks such as the stock market, and impute missing measurements in geographically dispersed networks of natural phenomena. △ Less

Submitted 14 February, 2021; originally announced February 2021.

Comments: Published in The Web Conference 2021. Code is available at https://github.com/alasdairtran/radflow

Journal ref: Proceedings of The Web Conference 2021 (WWW '21)

arXiv:2102.01974 [pdf, other]

doi 10.1145/3437963.3441703

AttentionFlow: Visualising Influence in Networks of Time Series

Authors: Minjeong Shin, Alasdair Tran, Siqi Wu, Alexander Mathews, Rong Wang, Georgiana Lyall, Lexing Xie

Abstract: The collective attention on online items such as web pages, search terms, and videos reflects trends that are of social, cultural, and economic interest. Moreover, attention trends of different items exhibit mutual influence via mechanisms such as hyperlinks or recommendations. Many visualisation tools exist for time series, network evolution, or network influence; however, few systems connect all… ▽ More The collective attention on online items such as web pages, search terms, and videos reflects trends that are of social, cultural, and economic interest. Moreover, attention trends of different items exhibit mutual influence via mechanisms such as hyperlinks or recommendations. Many visualisation tools exist for time series, network evolution, or network influence; however, few systems connect all three. In this work, we present AttentionFlow, a new system to visualise networks of time series and the dynamic influence they have on one another. Centred around an ego node, our system simultaneously presents the time series on each node using two visual encodings: a tree ring for an overview and a line chart for details. AttentionFlow supports interactions such as overlaying time series of influence and filtering neighbours by time or flux. We demonstrate AttentionFlow using two real-world datasets, VevoMusic and WikiTraffic. We show that attention spikes in songs can be explained by external events such as major awards, or changes in the network such as the release of a new song. Separate case studies also demonstrate how an artist's influence changes over their career, and that correlated Wikipedia traffic is driven by cultural interests. More broadly, AttentionFlow can be generalised to visualise networks of time series on physical infrastructures such as road networks, or natural phenomena such as weather and geological measurements. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: Published in WSDM 2021. The demo is available at https://attentionflow.ml and code is available at https://github.com/alasdairtran/attentionflow

Journal ref: The Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining (WSDM), 2021

arXiv:2010.02568 [pdf, other]

SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy

Authors: Umanga Bista, Alexander Patrick Mathews, Aditya Krishna Menon, Lexing Xie

Abstract: Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a… ▽ More Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a novel technique for generic and update summarization based on the maximum mean discrepancy from kernel two-sample testing. SupMMD combines both supervised learning for salience and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to make use of similarity across multiple information sources (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 15 pages

Journal ref: EMNLP 2020

arXiv:2007.14082 [pdf, other]

UNIPoint: Universally Approximating Point Processes Intensities

Authors: Alexander Soen, Alexander Mathews, Daniel Grixti-Cheng, Lexing Xie

Abstract: Point processes are a useful mathematical tool for describing events over time, and so there are many recent approaches for representing and learning them. One notable open question is how to precisely describe the flexibility of point process models and whether there exists a general model that can represent all point processes. Our work bridges this gap. Focusing on the widely used event intensi… ▽ More Point processes are a useful mathematical tool for describing events over time, and so there are many recent approaches for representing and learning them. One notable open question is how to precisely describe the flexibility of point process models and whether there exists a general model that can represent all point processes. Our work bridges this gap. Focusing on the widely used event intensity function representation of point processes, we provide a proof that a class of learnable functions can universally approximate any valid intensity function. The proof connects the well known Stone-Weierstrass Theorem for function approximation, the uniform density of non-negative continuous functions using a transfer functions, the formulation of the parameters of a piece-wise continuous functions as a dynamic system, and a recurrent neural network implementation for capturing the dynamics. Using these insights, we design and implement UNIPoint, a novel neural point process model, using recurrent neural networks to parameterise sums of basis function upon each event. Evaluations on synthetic and real world datasets show that this simpler representation performs better than Hawkes process variants and more complex neural network-based approaches. We expect this result will provide a practical basis for selecting and tuning models, as well as furthering theoretical work on representational complexity and learnability. △ Less

Submitted 2 March, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

arXiv:2004.08070 [pdf, other]

Transform and Tell: Entity-Aware News Image Captioning

Authors: Alasdair Tran, Alexander Mathews, Lexing Xie

Abstract: We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-… ▽ More We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts. On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue. △ Less

Submitted 12 June, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

Comments: Published in CVPR 2020. Code is available at https://github.com/alasdairtran/transform-and-tell and demo is available at https://transform-and-tell.ml

ACM Class: I.4.0; I.2.7

Journal ref: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13035-13045

arXiv:1812.02171 [pdf, other]

doi 10.1609/aaai.v33i01.330120

Comparative Document Summarisation via Classification

Authors: Umanga Bista, Alexander Mathews, Minjeong Shin, Aditya Krishna Menon, Lexing Xie

Abstract: This paper considers extractive summarisation in a comparative setting: given two or more document groups (e.g., separated by publication time), the goal is to select a small number of documents that are representative of each group, and also maximally distinguishable from other groups. We formulate a set of new objective functions for this problem that connect recent literature on document summar… ▽ More This paper considers extractive summarisation in a comparative setting: given two or more document groups (e.g., separated by publication time), the goal is to select a small number of documents that are representative of each group, and also maximally distinguishable from other groups. We formulate a set of new objective functions for this problem that connect recent literature on document summarisation, interpretable machine learning, and data subset selection. In particular, by casting the problem as a binary classification amongst different groups, we derive objectives based on the notion of maximum mean discrepancy, as well as a simple yet effective gradient-based optimisation strategy. Our new formulation allows scalable evaluations of comparative summarisation as a classification task, both automatically and via crowd-sourcing. To this end, we evaluate comparative summarisation methods on a newly curated collection of controversial news topics over 13 months. We observe that gradient-based optimisation outperforms discrete and baseline approaches in 14 out of 24 different automatic evaluation settings. In crowd-sourced evaluations, summaries from gradient optimisation elicit 7% more accurate classification from human workers than discrete optimisation. Our result contrasts with recent literature on submodular data subset selection that favours discrete optimisation. We posit that our formulation of comparative summarisation will prove useful in a diverse range of use cases such as comparing content sources, authors, related topics, or distinct view points. △ Less

Submitted 2 January, 2020; v1 submitted 5 December, 2018; originally announced December 2018.

Comments: Accepted for AAAI 2019

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019

arXiv:1805.07030 [pdf, other]

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text

Authors: Alexander Mathews, Lexing Xie, Xuming He

Abstract: Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low rel… ▽ More Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low relevance. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images. The core idea of this model, called SemStyle, is to separate semantics and style. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics. In addition, we develop a unified language model that decodes sentences with diverse word choices and syntax for different styles. Evaluations, both automatic and manual, show captions from SemStyle preserve image semantics, are descriptive, and are style shifted. More broadly, this work provides possibilities to learn richer image descriptions from the plethora of linguistic data available on the web. △ Less

Submitted 17 May, 2018; originally announced May 2018.

Comments: Accepted at CVPR 2018

arXiv:1805.05557 [pdf, other]

Simplifying Sentences with Sequence to Sequence Models

Authors: Alexander Mathews, Lexing Xie, Xuming He

Abstract: We simplify sentences with an attentive neural network sequence to sequence model, dubbed S4. The model includes a novel word-copy mechanism and loss function to exploit linguistic similarities between the original and simplified sentences. It also jointly uses pre-trained and fine-tuned word embeddings to capture the semantics of complex sentences and to mitigate the effects of limited data. When… ▽ More We simplify sentences with an attentive neural network sequence to sequence model, dubbed S4. The model includes a novel word-copy mechanism and loss function to exploit linguistic similarities between the original and simplified sentences. It also jointly uses pre-trained and fine-tuned word embeddings to capture the semantics of complex sentences and to mitigate the effects of limited data. When trained and evaluated on pairs of sentences from thousands of news articles, we observe a 8.8 point improvement in BLEU score over a sequence to sequence baseline; however, learning word substitutions remains difficult. Such sequence to sequence models are promising for other text generation tasks such as style transfer. △ Less

Submitted 15 May, 2018; originally announced May 2018.

arXiv:1709.08448 [pdf, ps, other]

Extracting Ontological Knowledge from Textual Descriptions

Authors: Kevin Alex Mathews, P Sreenivasa Kumar

Abstract: Authoring of OWL-DL ontologies is intellectually challenging and to make this process simpler, many systems accept natural language text as input. A text-based ontology authoring approach can be successful only when it is combined with an effective method for extracting ontological axioms from text. Extracting axioms from unrestricted English input is a substantially challenging task due to the ri… ▽ More Authoring of OWL-DL ontologies is intellectually challenging and to make this process simpler, many systems accept natural language text as input. A text-based ontology authoring approach can be successful only when it is combined with an effective method for extracting ontological axioms from text. Extracting axioms from unrestricted English input is a substantially challenging task due to the richness of the language. Controlled natural languages (CNLs) have been proposed in this context and these tend to be highly restrictive. In this paper, we propose a new CNL called TEDEI (TExtual DEscription Identifier) whose grammar is inspired by the different ways OWL-DL constructs are expressed in English. We built a system that transforms TEDEI sentences into corresponding OWL-DL axioms. Now, ambiguity due to different possible lexicalizations of sentences and semantic ambiguity present in sentences are challenges in this context. We find that the best way to handle these challenges is to construct axioms corresponding to alternative formalizations of the sentence so that the end-user can make an appropriate choice. The output is compared against human-authored axioms and in substantial number of cases, human-authored axiom is indeed one of the alternatives given by the system. The proposed system substantially enhances the types of sentence structures that can be used for ontology authoring. △ Less

Submitted 28 September, 2017; v1 submitted 25 September, 2017; originally announced September 2017.

Comments: 8 pages

arXiv:1510.01431 [pdf, other]

SentiCap: Generating Image Descriptions with Sentiments

Authors: Alexander Mathews, Lexing Xie, Xuming He

Abstract: The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such style is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a sys… ▽ More The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such style is descriptions with emotions, which is commonplace in everyday communication, and influences decision-making and interpersonal relationships. We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments. We propose a novel switching recurrent neural network with word-level regularization, which is able to produce emotional image captions using only 2000+ training sentences containing sentiments. We evaluate the captions with different automatic and crowd-sourcing metrics. Our model compares favourably in common quality metrics for image captioning. In 84.6% of cases the generated positive captions were judged as being at least as descriptive as the factual captions. Of these positive captions 88% were confirmed by the crowd-sourced workers as having the appropriate sentiment. △ Less

Submitted 13 December, 2015; v1 submitted 6 October, 2015; originally announced October 2015.

ACM Class: I.2.10; I.2.7; I.2.6

Showing 1–20 of 20 results for author: Mathews, A