-
A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
Authors:
Zhiyu Zoey Chen,
**g Ma,
Xinlu Zhang,
Nan Hao,
An Yan,
Armineh Nourbakhsh,
Xianjun Yang,
Julian McAuley,
Linda Petzold,
William Yang Wang
Abstract:
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications…
▽ More
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{https://github.com/czyssrs/LLM_X_papers}.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Authors:
Ran Zmigrod,
Dongsheng Wang,
Mathieu Sibue,
Yulong Pei,
Petr Babkin,
Ivan Brugere,
Xiaomo Liu,
Nacho Navarro,
Antony Papadimitriou,
William Watson,
Zhiqiang Ma,
Armineh Nourbakhsh,
Sameena Shah
Abstract:
The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse ann…
▽ More
The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Belief and Persuasion in Scientific Discourse on Social Media: A Study of the COVID-19 Pandemic
Authors:
Salwa Alamir,
Armineh Nourbakhsh,
Cecilia Tilli,
Sameena Shah,
Manuela Veloso
Abstract:
Research into COVID-19 has been rapidly evolving since the onset of the pandemic. This occasionally results in contradictory recommendations by credible sources of scientific opinion, public health authorities, and medical professionals. In this study, we examine whether this has resulted in a lack of trust in scientific opinion, by examining the belief patterns of social media users and their rea…
▽ More
Research into COVID-19 has been rapidly evolving since the onset of the pandemic. This occasionally results in contradictory recommendations by credible sources of scientific opinion, public health authorities, and medical professionals. In this study, we examine whether this has resulted in a lack of trust in scientific opinion, by examining the belief patterns of social media users and their reactions to statements related to scientific facts. We devise models to mine belief and persuasion in Twitter discourse using semi-supervised approaches, and show the relationship between lack of belief and insurgence of paranoia and conspiracy theories. By investigating these belief patterns, we explore the best persuasion tactics for communicating information related to COVID-19.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing
Authors:
Ran Zmigrod,
Zhiqiang Ma,
Armineh Nourbakhsh,
Sameena Shah
Abstract:
Visually Rich Form Understanding (VRFU) poses a complex research problem due to the documents' highly structured nature and yet highly variable style and content. Current annotation schemes decompose form understanding and omit key hierarchical structure, making development and evaluation of end-to-end models difficult. In this paper, we propose a novel F1 metric to evaluate form parsers and descr…
▽ More
Visually Rich Form Understanding (VRFU) poses a complex research problem due to the documents' highly structured nature and yet highly variable style and content. Current annotation schemes decompose form understanding and omit key hierarchical structure, making development and evaluation of end-to-end models difficult. In this paper, we propose a novel F1 metric to evaluate form parsers and describe a new content-agnostic, tree-based annotation scheme for VRFU: TreeForm. We provide methods to convert previous annotation schemes into TreeForm structures and evaluate TreeForm predictions using a modified version of the normalized tree-edit distance. We present initial baselines for our end-to-end performance metric and the TreeForm edit distance, averaged over the FUNSD and XFUND datasets, of 61.5 and 26.4 respectively. We hope that TreeForm encourages deeper research in annotating, modeling, and evaluating the complexities of form-like documents.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
DocGraphLM: Documental Graph Language Model for Information Extraction
Authors:
Dongsheng Wang,
Zhiqiang Ma,
Armineh Nourbakhsh,
Kang Gu,
Sameena Shah
Abstract:
Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve t…
▽ More
Advances in Visually Rich Document Understanding (VrDU) have enabled information extraction and question answering over documents with complex layouts. Two tropes of architectures have emerged -- transformer-based models inspired by LLMs, and Graph Neural Networks. In this paper, we introduce DocGraphLM, a novel framework that combines pre-trained language models with graph semantics. To achieve this, we propose 1) a joint encoder architecture to represent documents, and 2) a novel link prediction approach to reconstruct document graphs. DocGraphLM predicts both directions and distances between nodes using a convergent joint loss function that prioritizes neighborhood restoration and downweighs distant node detection. Our experiments on three SotA datasets show consistent improvement on IE and QA tasks with the adoption of graph features. Moreover, we report that adopting the graph features accelerates convergence in the learning process during training, despite being solely constructed through link prediction.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
The Influence of Biomedical Research on Future Business Funding: Analyzing Scientific Impact and Content in Industrial Investments
Authors:
Reza Khanmohammadi,
Simerjot Kaur,
Charese H. Smiley,
Tuka Alhanai,
Ivan Brugere,
Armineh Nourbakhsh,
Mohammad M. Ghassemi
Abstract:
This paper investigates the relationship between scientific innovation in biomedical sciences and its impact on industrial activities, focusing on how the historical impact and content of scientific papers influenced future funding and innovation grant application content for small businesses. The research incorporates bibliometric analyses along with SBIR (Small Business Innovation Research) data…
▽ More
This paper investigates the relationship between scientific innovation in biomedical sciences and its impact on industrial activities, focusing on how the historical impact and content of scientific papers influenced future funding and innovation grant application content for small businesses. The research incorporates bibliometric analyses along with SBIR (Small Business Innovation Research) data to yield a holistic view of the science-industry interface. By evaluating the influence of scientific innovation on industry across 10,873 biomedical topics and taking into account their taxonomic relationships, we present an in-depth exploration of science-industry interactions where we quantify the temporal effects and impact latency of scientific advancements on industrial activities, spanning from 2010 to 2021. Our findings indicate that scientific progress substantially influenced industrial innovation funding and the direction of industrial innovation activities. Approximately 76% and 73% of topics showed a correlation and Granger-causality between scientific interest in papers and future funding allocations to relevant small businesses. Moreover, around 74% of topics demonstrated an association between the semantic content of scientific abstracts and future grant applications. Overall, the work contributes to a more nuanced and comprehensive understanding of the science-industry interface, opening avenues for more strategic resource allocation and policy developments aimed at fostering innovation.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Authors:
Dongsheng Wang,
Natraj Raman,
Mathieu Sibue,
Zhiqiang Ma,
Petr Babkin,
Simerjot Kaur,
Yulong Pei,
Armineh Nourbakhsh,
Xiaomo Liu
Abstract:
Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs…
▽ More
Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
The Dark Side of the Language: Pre-trained Transformers in the DarkNet
Authors:
Leonardo Ranaldi,
Aria Nourbakhsh,
Arianna Patrizi,
Elena Sofia Ruzzetti,
Dario Onorati,
Francesca Fallucchi,
Fabio Massimo Zanzotto
Abstract:
Pre-trained Transformers are challenging human performances in many NLP tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained Natural Language Understanding models perform on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that synta…
▽ More
Pre-trained Transformers are challenging human performances in many NLP tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained Natural Language Understanding models perform on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks perform on par with pre-trained Transformers even after fine-tuning. Only after what we call extreme domain adaptation, that is, retraining with the masked language model task on all the novel corpus, pre-trained Transformers reach their standard high results. This suggests that huge pre-training corpora may give Transformers unexpected help since they are exposed to many of the possible sentences.
△ Less
Submitted 17 November, 2023; v1 submitted 14 January, 2022;
originally announced January 2022.
-
Quantum Computing: Fundamentals, Trends and Perspectives for Chemical and Biochemical Engineers
Authors:
Amirhossein Nourbakhsh,
Mark Nicholas Jones,
Kaur Kristjuhan,
Deborah Carberry,
Jay Karon,
Christian Beenfeldt,
Kyarash Shahriari,
Martin P. Andersson,
Mojgan A. Jadidi,
Seyed Soheil Mansouri
Abstract:
We use the benefits and components of classical computers every day. However, there are many types of problems which, as they grow in size, their computational complexity grows larger than classical computers will ever be able to solve. Quantum computing (QC) is a computation model that uses quantum physical properties to solve such problems. QC is at the early stage of large-scale adoption in var…
▽ More
We use the benefits and components of classical computers every day. However, there are many types of problems which, as they grow in size, their computational complexity grows larger than classical computers will ever be able to solve. Quantum computing (QC) is a computation model that uses quantum physical properties to solve such problems. QC is at the early stage of large-scale adoption in various industry domains to take advantage of the algorithmic speed-ups it has to offer. It can be applied in a variety of areas, such as computer science, mathematics, chemical and biochemical engineering, and the financial industry. The main goal of this paper is to give an overview to chemical and biochemical researchers and engineers who may not be familiar with quantum computation. Thus, the paper begins by explaining the fundamental concepts of QC. The second contribution this publication tries to tackle is the fact that the chemical engineering literature still lacks a comprehensive review of the recent advances of QC. Therefore, this article reviews and summarizes the state of the art to gain insight into how quantum computation can benefit and optimize chemical engineering issues. A bibliography analysis covers the comprehensive literature in QC and analyzes quantum computing research in chemical engineering on various publication topics, using Clarivate analytics covering the years 1990 to 2020. After the bibliographic analysis, relevant applications of QC in chemical and biochemical engineering are highlighted and a conclusion offers an outlook of future directions within the field.
△ Less
Submitted 8 January, 2022;
originally announced January 2022.
-
Parameterized Explanations for Investor / Company Matching
Authors:
Simerjot Kaur,
Ivan Brugere,
Andrea Stefanucci,
Armineh Nourbakhsh,
Sameena Shah,
Manuela Veloso
Abstract:
Matching companies and investors is usually considered a highly specialized decision making process. Building an AI agent that can automate such recommendation process can significantly help reduce costs, and eliminate human biases and errors. However, limited sample size of financial data-sets and the need for not only good recommendations, but also explaining why a particular recommendation is b…
▽ More
Matching companies and investors is usually considered a highly specialized decision making process. Building an AI agent that can automate such recommendation process can significantly help reduce costs, and eliminate human biases and errors. However, limited sample size of financial data-sets and the need for not only good recommendations, but also explaining why a particular recommendation is being made, makes this a challenging problem. In this work we propose a representation learning based recommendation engine that works extremely well with small datasets and demonstrate how it can be coupled with a parameterized explanation generation engine to build an explainable recommendation system for investor-company matching. We compare the performance of our system with human generated recommendations and demonstrate the ability of our algorithm to perform extremely well on this task. We also highlight how explainability helps with real-life adoption of our system.
△ Less
Submitted 27 October, 2021;
originally announced November 2021.
-
A Framework for Institutional Risk Identification using Knowledge Graphs and Automated News Profiling
Authors:
Mahmoud Mahfouz,
Armineh Nourbakhsh,
Sameena Shah
Abstract:
Organizations around the world face an array of risks impacting their operations globally. It is imperative to have a robust risk identification process to detect and evaluate the impact of potential risks before they materialize. Given the nature of the task and the current requirements of deep subject matter expertise, most organizations utilize a heavily manual process. In our work, we develop…
▽ More
Organizations around the world face an array of risks impacting their operations globally. It is imperative to have a robust risk identification process to detect and evaluate the impact of potential risks before they materialize. Given the nature of the task and the current requirements of deep subject matter expertise, most organizations utilize a heavily manual process. In our work, we develop an automated system that (a) continuously monitors global news, (b) is able to autonomously identify and characterize risks, (c) is able to determine the proximity of reaching triggers to determine the distance from the manifestation of the risk impact and (d) identifies organization's operational areas that may be most impacted by the risk. Other contributions also include: (a) a knowledge graph representation of risks and (b) relevant news matching to risks identified by the organization utilizing a neural embedding model to match the textual description of a given risk with multi-lingual news.
△ Less
Submitted 19 September, 2021;
originally announced September 2021.
-
Robust Document Representations using Latent Topics and Metadata
Authors:
Natraj Raman,
Armineh Nourbakhsh,
Sameena Shah,
Manuela Veloso
Abstract:
Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representa…
▽ More
Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labeled examples, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple datasets.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
DocuBot : Generating financial reports using natural language interactions
Authors:
Vineeth Ravi,
Selim Amrouni,
Andrea Stefanucci,
Armineh Nourbakhsh,
Prashant Reddy,
Manuela Veloso
Abstract:
The financial services industry perpetually processes an overwhelming amount of complex data. Digital reports are often created based on tedious manual analysis as well as visualization of the underlying trends and characteristics of data. Often, the accruing costs of human computation errors in creating these reports are very high. We present DocuBot, a novel AI-powered virtual assistant for crea…
▽ More
The financial services industry perpetually processes an overwhelming amount of complex data. Digital reports are often created based on tedious manual analysis as well as visualization of the underlying trends and characteristics of data. Often, the accruing costs of human computation errors in creating these reports are very high. We present DocuBot, a novel AI-powered virtual assistant for creating and modifying content in digital documents by modeling natural language interactions as "skills" and using them to transform underlying data. DocuBot has the ability to agglomerate saved skills for reuse, enabling humans to automatically generate recurrent reports. DocuBot also has the capability to continuously learn domain-specific and user-specific vocabulary by interacting with the user. We present evidence that DocuBot adds value to the financial industry and demonstrate its impact with experiments involving real and simulated users tasked with creating PowerPoint presentations.
△ Less
Submitted 1 February, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Impact of $Al_2O_3$ Passivation on the Photovoltaic Performance of Vertical $WSe_2$ Schottky Junction Solar Cells
Authors:
Elaine McVay,
Ahmad Zubair,
Yuxuan Lin,
Amirhasan Nourbakhsh,
Tomás Palacios
Abstract:
Transition metal dichalcogenide (TMD) materials have emerged as promising candidates for thin film solar cells due to their wide bandgap range across the visible wavelengths, high absorption coefficient and ease of integration with both arbitrary substrates as well as conventional semiconductor technologies. However, reported TMD-based solar cells suffer from relatively low external quantum effici…
▽ More
Transition metal dichalcogenide (TMD) materials have emerged as promising candidates for thin film solar cells due to their wide bandgap range across the visible wavelengths, high absorption coefficient and ease of integration with both arbitrary substrates as well as conventional semiconductor technologies. However, reported TMD-based solar cells suffer from relatively low external quantum efficiencies (EQE) and low open circuit voltage due to unoptimized design and device fabrication. This paper studies $Pt/WSe_2$ vertical Schottky junction solar cells with various $WSe_2$ thicknesses in order to find the optimum absorber thickness.Also, we show that the photovoltaic performance can be improved via $Al_2O_3$ passivation which increases the EQE by up to 29.5% at 410 nm wavelength incident light. The overall resulting short circuit current improves through antireflection coating, surface do**, and surface trap passivation effects. Thanks to the ${Al_2O_3}$ coating, this work demonstrates a device with open circuit voltage ($V_{OC}$) of 380 mV and short circuit current density ($J_{SC}$) of 10.7 $mA/cm^2$. Finally, the impact of Schottky barrier height inhomogeneity at the $Pt/WSe_2$ contact is investigated as a source of open circuit voltage lowering in these devices
△ Less
Submitted 30 June, 2020;
originally announced June 2020.
-
SPot: A tool for identifying operating segments in financial tables
Authors:
Zhiqiang Ma,
Steven Pomerville,
Mingyang Di,
Armineh Nourbakhsh
Abstract:
In this paper we present SPot, an automated tool for detecting operating segments and their related performance indicators from earnings reports. Due to their company-specific nature, operating segments cannot be detected using taxonomy-based approaches. Instead, we train a Bidirectional RNN classifier that can distinguish between common metrics such as "revenue" and company-specific metrics that…
▽ More
In this paper we present SPot, an automated tool for detecting operating segments and their related performance indicators from earnings reports. Due to their company-specific nature, operating segments cannot be detected using taxonomy-based approaches. Instead, we train a Bidirectional RNN classifier that can distinguish between common metrics such as "revenue" and company-specific metrics that are likely to be operating segments, such as "iPhone" or "cloud services". SPot surfaces the results in an interactive web interface that allows users to trace and adjust performance metrics for each operating segment. This facilitates credit monitoring, enables them to perform competitive benchmarking more effectively, and can be used for trend analysis at company and sector levels.
△ Less
Submitted 17 May, 2020;
originally announced May 2020.
-
Toward Dialogue Modeling: A Semantic Annotation Scheme for Questions and Answers
Authors:
Maria-Andrea Cruz-Blandón,
Gosse Minnema,
Aria Nourbakhsh,
Maria Boritchev,
Maxime Amblard
Abstract:
The present study proposes an annotation scheme for classifying the content and discourse contribution of question-answer pairs. We propose detailed guidelines for using the scheme and apply them to dialogues in English, Spanish, and Dutch. Finally, we report on initial machine learning experiments for automatic annotation.
The present study proposes an annotation scheme for classifying the content and discourse contribution of question-answer pairs. We propose detailed guidelines for using the scheme and apply them to dialogues in English, Spanish, and Dutch. Finally, we report on initial machine learning experiments for automatic annotation.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
A framework for anomaly detection using language modeling, and its applications to finance
Authors:
Armineh Nourbakhsh,
Grace Bang
Abstract:
In the finance sector, studies focused on anomaly detection are often associated with time-series and transactional data analytics. In this paper, we lay out the opportunities for applying anomaly and deviation detection methods to text corpora and challenges associated with them. We argue that language models that use distributional semantics can play a significant role in advancing these studies…
▽ More
In the finance sector, studies focused on anomaly detection are often associated with time-series and transactional data analytics. In this paper, we lay out the opportunities for applying anomaly and deviation detection methods to text corpora and challenges associated with them. We argue that language models that use distributional semantics can play a significant role in advancing these studies in novel directions, with new applications in risk identification, predictive modeling, and trend analysis.
△ Less
Submitted 24 August, 2019;
originally announced August 2019.
-
Reuters Tracer: Toward Automated News Production Using Large Scale Social Media Data
Authors:
Xiaomo Liu,
Armineh Nourbakhsh,
Quanzhi Li,
Sameena Shah,
Robert Martin,
John Duprey
Abstract:
To deal with the sheer volume of information and gain competitive advantage, the news industry has started to explore and invest in news automation. In this paper, we present Reuters Tracer, a system that automates end-to-end news production using Twitter data. It is capable of detecting, classifying, annotating, and disseminating news in real time for Reuters journalists without manual interventi…
▽ More
To deal with the sheer volume of information and gain competitive advantage, the news industry has started to explore and invest in news automation. In this paper, we present Reuters Tracer, a system that automates end-to-end news production using Twitter data. It is capable of detecting, classifying, annotating, and disseminating news in real time for Reuters journalists without manual intervention. In contrast to other similar systems, Tracer is topic and domain agnostic. It has a bottom-up approach to news detection, and does not rely on a predefined set of sources or subjects. Instead, it identifies emerging conversations from 12+ million tweets per day and selects those that are news-like. Then, it contextualizes each story by adding a summary and a topic to it, estimating its newsworthiness, veracity, novelty, and scope, and geotags it. Designing algorithms to generate news that meets the standards of Reuters journalists in accuracy and timeliness is quite challenging. But Tracer is able to achieve competitive precision, recall, timeliness, and veracity on news detection and delivery. In this paper, we reveal our key algorithm designs and evaluations that helped us achieve this goal, and lessons learned along the way.
△ Less
Submitted 10 November, 2017;
originally announced November 2017.
-
"Breaking" Disasters: Predicting and Characterizing the Global News Value of Natural and Man-made Disasters
Authors:
Armineh Nourbakhsh,
Quanzhi Li,
Xiaomo Liu,
Sameena Shah
Abstract:
Due to their often unexpected nature, natural and man-made disasters are difficult to monitor and detect for journalists and disaster management response teams. Journalists are increasingly relying on signals from social media to detect such stories in their early stage of development. Twitter, which features a vast network of local news outlets, is a major source of early signal for disaster dete…
▽ More
Due to their often unexpected nature, natural and man-made disasters are difficult to monitor and detect for journalists and disaster management response teams. Journalists are increasingly relying on signals from social media to detect such stories in their early stage of development. Twitter, which features a vast network of local news outlets, is a major source of early signal for disaster detection. Journalists who work for global desks often follow these sources via Twitter's lists, but have to comb through thousands of small-scale or low-impact stories to find events that may be globally relevant. These are events that have a large scope, high impact, or potential geo-political relevance. We propose a model for automatically identifying events from local news sources that may break on a global scale within the next 24 hours. The results are promising and can be used in a predictive setting to help journalists manage their sources more effectively, or in a descriptive manner to analyze media coverage of disasters. Through the feature evaluation process, we also address the question: "what makes a disaster event newsworthy on a global scale?" As part of our data collection process, we have created a list of local sources of disaster/accident news on Twitter, which we have made publicly available.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
Data Sets: Word Embeddings Learned from Tweets and General Data
Authors:
Quanzhi Li,
Sameena Shah,
Xiaomo Liu,
Armineh Nourbakhsh
Abstract:
A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore…
▽ More
A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.
△ Less
Submitted 13 August, 2017;
originally announced August 2017.