Search | arXiv e-print repository

SEAM: A Stochastic Benchmark for Multi-Document Tasks

Authors: Gili Lior, Avi Caciularu, Arie Cattan, Shahar Levy, Ori Shapira, Gabriel Stanovsky

Abstract: Various tasks, such as summarization, multi-hop question answering, or coreference resolution, are naturally phrased over collections of real-world documents. Such tasks present a unique set of challenges, revolving around the lack of coherent narrative structure across documents, which often leads to contradiction, omission, or repetition of information. Despite their real-world application and c… ▽ More Various tasks, such as summarization, multi-hop question answering, or coreference resolution, are naturally phrased over collections of real-world documents. Such tasks present a unique set of challenges, revolving around the lack of coherent narrative structure across documents, which often leads to contradiction, omission, or repetition of information. Despite their real-world application and challenging properties, there is currently no benchmark which specifically measures the abilities of large language models (LLMs) on multi-document tasks. To bridge this gap, we present SEAM (a Stochastic Evaluation Approach for Multi-document tasks), a conglomerate benchmark over a diverse set of multi-document datasets, setting conventional evaluation criteria, input-output formats, and evaluation protocols. In particular, SEAM addresses the sensitivity of LLMs to minor prompt variations through repeated evaluations, where in each evaluation we sample uniformly at random the values of arbitrary factors (e.g., the order of documents). We evaluate different LLMs on SEAM finding that multi-document tasks pose a significant challenge for LLMs, even for state-of-the-art models with 70B parameters. In addition, we show that the stochastic approach uncovers underlying statistical trends which cannot be observed in a static benchmark. We hope that SEAM will spur progress via consistent and meaningful evaluation of multi-document tasks. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2403.11092 [pdf, other]

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and co… ▽ More Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: NAACL 2024 Main Conference

arXiv:2403.04858 [pdf, other]

Evaluating Biases in Context-Dependent Health Questions

Authors: Sharon Levy, Tahilin Sanchez Karver, William D. Adler, Michelle R. Kaufman, Mark Dredze

Abstract: Chat-based large language models have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how large language model biases are exhibited through these contextual questions in the healthcare… ▽ More Chat-based large language models have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how large language model biases are exhibited through these contextual questions in the healthcare domain. To accomplish this, we curate a dataset of sexual and reproductive healthcare questions that are dependent on age, sex, and location attributes. We compare models' outputs with and without demographic context to determine group alignment among our contextual questions. Our experiments reveal biases in each of these attributes, where young adult female users are favored. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2312.01217 [pdf, other]

Understanding Opinions Towards Climate Change on Social Media

Authors: Yashaswi Pupneja, Joseph Zou, Sacha Lévy, Shenyang Huang

Abstract: Social media platforms such as Twitter (now known as X) have revolutionized how the public engage with important societal and political topics. Recently, climate change discussions on social media became a catalyst for political polarization and the spreading of misinformation. In this work, we aim to understand how real world events influence the opinions of individuals towards climate change rel… ▽ More Social media platforms such as Twitter (now known as X) have revolutionized how the public engage with important societal and political topics. Recently, climate change discussions on social media became a catalyst for political polarization and the spreading of misinformation. In this work, we aim to understand how real world events influence the opinions of individuals towards climate change related topics on social media. To this end, we extracted and analyzed a dataset of 13.6 millions tweets sent by 3.6 million users from 2006 to 2019. Then, we construct a temporal graph from the user-user mentions network and utilize the Louvain community detection algorithm to analyze the changes in community structure around Conference of the Parties on Climate Change~(COP) events. Next, we also apply tools from the Natural Language Processing literature to perform sentiment analysis and topic modeling on the tweets. Our work acts as a first step towards understanding the evolution of pro-climate change communities around COP events. Answering these questions helps us understand how to raise people's awareness towards climate change thus hopefully calling on more individuals to join the collaborative effort in slowing down climate change. △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2310.09624 [pdf, other]

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

Authors: Alex Mei, Sharon Levy, William Yang Wang

Abstract: As large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.Robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming, consisting of three methods -- sem… ▽ More As large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.Robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming, consisting of three methods -- semantically aligned augmentation, target bootstrap**, and adversarial knowledge injection. For robust safety evaluation, we apply these methods in the critical domain of AI safety to algorithmically generate a test suite of prompts covering diverse robustness settings -- semantic equivalence, related scenarios, and adversarial. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. Despite dedicated safeguards in existing state-of-the-art models, we find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings, raising concerns for users' physical safety. △ Less

Submitted 11 November, 2023; v1 submitted 14 October, 2023; originally announced October 2023.

Comments: In Findings of the 2023 Conference on Empirical Methods in Natural Language Processing

arXiv:2310.01618 [pdf, other]

Operator Learning Meets Numerical Analysis: Improving Neural Networks through Iterative Methods

Authors: Emanuele Zappala, Daniel Levine, Sizhuang He, Syed Rizvi, Sacha Levy, David van Dijk

Abstract: Deep neural networks, despite their success in numerous applications, often function without established theoretical foundations. In this paper, we bridge this gap by drawing parallels between deep learning and classical numerical analysis. By framing neural networks as operators with fixed points representing desired solutions, we develop a theoretical framework grounded in iterative methods for… ▽ More Deep neural networks, despite their success in numerous applications, often function without established theoretical foundations. In this paper, we bridge this gap by drawing parallels between deep learning and classical numerical analysis. By framing neural networks as operators with fixed points representing desired solutions, we develop a theoretical framework grounded in iterative methods for operator equations. Under defined conditions, we present convergence proofs based on fixed point theory. We demonstrate that popular architectures, such as diffusion models and AlphaFold, inherently employ iterative operator learning. Empirical assessments highlight that performing iterations through network operators improves performance. We also introduce an iterative graph neural network, PIGN, that further demonstrates benefits of iterations. Our work aims to enhance the understanding of deep learning by merging insights from numerical analysis, potentially guiding the design of future networks with clearer theoretical underpinnings and improved performance. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: 27 pages (13+14). 8 Figures and 5 tables. Comments are welcome!

arXiv:2308.13699 [pdf, other]

Party Prediction for Twitter

Authors: Kellin Pelrine, Anne Imouza, Zachary Yang, Jacob-Junqi Tian, Sacha Lévy, Gabrielle Desrosiers-Brisebois, Aarash Feizi, Cécile Amadoro, André Blais, Jean-François Godbout, Reihaneh Rabbany

Abstract: A large number of studies on social media compare the behaviour of users from different political parties. As a basic step, they employ a predictive model for inferring their political affiliation. The accuracy of this model can change the conclusions of a downstream analysis significantly, yet the choice between different models seems to be made arbitrarily. In this paper, we provide a comprehens… ▽ More A large number of studies on social media compare the behaviour of users from different political parties. As a basic step, they employ a predictive model for inferring their political affiliation. The accuracy of this model can change the conclusions of a downstream analysis significantly, yet the choice between different models seems to be made arbitrarily. In this paper, we provide a comprehensive survey and an empirical comparison of the current party prediction practices and propose several new approaches which are competitive with or outperform state-of-the-art methods, yet require less computational resources. Party prediction models rely on the content generated by the users (e.g., tweet texts), the relations they have (e.g., who they follow), or their activities and interactions (e.g., which tweets they like). We examine all of these and compare their signal strength for the party prediction task. This paper lets the practitioner select from a wide range of data types that all give strong performance. Finally, we conduct extensive experiments on different aspects of these methods, such as data collection speed and transfer capabilities, which can provide further insights for both applied and methodological research. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2307.04228 [pdf, other]

Bayesian tomography using polynomial chaos expansion and deep generative networks

Authors: Giovanni Angelo Meles, Macarena Amaya, Shiran Levy, Stefano Marelli, Niklas Linde

Abstract: Implementations of Markov chain Monte Carlo (MCMC) methods need to confront two fundamental challenges: accurate representation of prior information and efficient evaluation of likelihoods. Principal component analysis (PCA) and related techniques can in some cases facilitate the definition and sampling of the prior distribution, as well as the training of accurate surrogate models, using for inst… ▽ More Implementations of Markov chain Monte Carlo (MCMC) methods need to confront two fundamental challenges: accurate representation of prior information and efficient evaluation of likelihoods. Principal component analysis (PCA) and related techniques can in some cases facilitate the definition and sampling of the prior distribution, as well as the training of accurate surrogate models, using for instance, polynomial chaos expansion (PCE). However, complex geological priors with sharp contrasts necessitate more complex dimensionality-reduction techniques, such as, deep generative models (DGMs). By sampling a low-dimensional prior probability distribution defined in the low-dimensional latent space of such a model, it becomes possible to efficiently sample the physical domain at the price of a generator that is typically highly non-linear. Training a surrogate that is capable of capturing intricate non-linear relationships between latent parameters and outputs of forward modeling presents a notable challenge. Indeed, while PCE models provide high accuracy when the input-output relationship can be effectively approximated by relatively low-degree multivariate polynomials, this condition is typically not met when employing latent variables derived from DGMs. In this contribution, we present a strategy combining the excellent reconstruction performances of a variational autoencoder (VAE) with the accuracy of PCA-PCE surrogate modeling in the context of Bayesian ground penetrating radar (GPR) traveltime tomography. Within the MCMC process, the parametrization of the VAE is leveraged for prior exploration and sample proposals. Concurrently, surrogate modeling is conducted using PCE, which operates on either globally or locally defined principal components of the VAE samples under examination. △ Less

Submitted 19 October, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

Comments: 25 pages, 15 figures

arXiv:2305.11242 [pdf, other]

Comparing Biases and the Impact of Multilingual Training across Multiple Languages

Authors: Sharon Levy, Neha Anna John, Ling Liu, Yogarshi Vyas, Jie Ma, Yoshinari Fu**uma, Miguel Ballesteros, Vittorio Castelli, Dan Roth

Abstract: Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases com… ▽ More Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases compare across languages and how the biases are affected when training a model on multilingual data versus monolingual data. We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task to observe whether specific demographics are viewed more positively. We study bias similarities and differences across these languages and investigate the impact of multilingual vs. monolingual training data. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture (e.g. majority religions and nationalities). Additionally, we find an increased variation in predictions across protected groups, indicating bias amplification, after multilingual finetuning in comparison to multilingual pretraining. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.11122 [pdf, other]

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

Authors: W. Pepper Marts, Matthew G. F. Dosanjh, Whit Schonbein, Scott Levy, Patrick G. Bridges

Abstract: Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the b… ▽ More Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Report number: SAND2023-02469O

arXiv:2212.09667 [pdf, other]

Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI

Authors: Alex Mei, Sharon Levy, William Yang Wang

Abstract: Users' physical safety is an increasing concern as the market for intelligent systems continues to grow, where unconstrained systems may recommend users dangerous actions that can lead to serious injury. Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework leveraging exter… ▽ More Users' physical safety is an increasing concern as the market for intelligent systems continues to grow, where unconstrained systems may recommend users dangerous actions that can lead to serious injury. Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety. In particular, FARM foveates on missing knowledge to qualify the information required to reason in specific scenarios and retrieves this information with attribution to trustworthy sources. This knowledge is used to both classify the safety of the original text and generate human-interpretable rationales, shedding light on the risk of systems to specific user groups and hel** both stakeholders manage the risks of their systems and policymakers to provide concrete safeguards for consumer safety. Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%. △ Less

Submitted 19 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: In Findings of the 2023 Conference of the Association for Computational Linguistics

arXiv:2210.12152 [pdf, other]

WikiWhy: Answering and Explaining Cause-and-Effect Questions

Authors: Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang

Abstract: As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an ans… ▽ More As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements. △ Less

Submitted 30 November, 2022; v1 submitted 21 October, 2022; originally announced October 2022.

arXiv:2210.10045 [pdf, other]

SafeText: A Benchmark for Exploring Physical Safety in Language Models

Authors: Sharon Levy, Emily Allaway, Melanie Subbiah, Lydia Chilton, Desmond Patton, Kathleen McKeown, William Yang Wang

Abstract: Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create th… ▽ More Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: Accepted to EMNLP 2022

arXiv:2210.09306 [pdf, other]

Mitigating Covertly Unsafe Text within Natural Language Systems

Authors: Alex Mei, Anisha Kabir, Sharon Levy, Melanie Subbiah, Emily Allaway, John Judge, Desmond Patton, Bruce Bimber, Kathleen McKeown, William Yang Wang

Abstract: An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particul… ▽ More An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system's information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems. △ Less

Submitted 20 March, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing

arXiv:2209.11135 [pdf, other]

Active Keyword Selection to Track Evolving Topics on Twitter

Authors: Sacha Lévy, Farimah Poursafaei, Kellin Pelrine, Reihaneh Rabbany

Abstract: How can we study social interactions on evolving topics at a mass scale? Over the past decade, researchers from diverse fields such as economics, political science, and public health have often done this by querying Twitter's public API endpoints with hand-picked topical keywords to search or stream discussions. However, despite the API's accessibility, it remains difficult to select and update ke… ▽ More How can we study social interactions on evolving topics at a mass scale? Over the past decade, researchers from diverse fields such as economics, political science, and public health have often done this by querying Twitter's public API endpoints with hand-picked topical keywords to search or stream discussions. However, despite the API's accessibility, it remains difficult to select and update keywords to collect high-quality data relevant to topics of interest. In this paper, we propose an active learning method for rapidly refining query keywords to increase both the yielded topic relevance and dataset size. We leverage a large open-source COVID-19 Twitter dataset to illustrate the applicability of our method in tracking Tweets around the key sub-topics of Vaccine, Mask, and Lockdown. Our experiments show that our method achieves an average topic-related keyword recall 2x higher than baselines. We open-source our code along with a web interface for keyword selection to make data collection from Twitter more systematic for researchers. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: 10 pages, 3 figures

arXiv:2205.09830 [pdf, ps, other]

Towards Understanding Gender-Seniority Compound Bias in Natural Language Generation

Authors: Samhita Honnavalli, Aesha Parekh, Lily Ou, Sophie Groenwold, Sharon Levy, Vicente Ordonez, William Yang Wang

Abstract: Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender… ▽ More Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender bias exhibited in pretrained neural generation models by introducing a novel framework for probing compound bias. We contribute a benchmark robustness-testing dataset spanning two domains, U.S. senatorship and professorship, created using a distant-supervision method. Our dataset includes human-written text with underlying ground truth and paired counterfactuals. We then examine GPT-2 perplexity and the frequency of gendered language in generated text. Our results show that GPT-2 amplifies bias by considering women as junior and men as senior more often than the ground truth in both domains. These results suggest that NLP applications built using GPT-2 may harm women in professional capacities. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 6 pages, LREC 2022

arXiv:2204.13243 [pdf, other]

HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data

Authors: Kai Nakamura, Sharon Levy, Yi-Lin Tuan, Wenhu Chen, William Yang Wang

Abstract: A pressing challenge in current dialogue systems is to successfully converse with users on topics with information distributed across different modalities. Previous work in multiturn dialogue systems has primarily focused on either text or table information. In more realistic scenarios, having a joint understanding of both is critical as knowledge is typically distributed over both unstructured an… ▽ More A pressing challenge in current dialogue systems is to successfully converse with users on topics with information distributed across different modalities. Previous work in multiturn dialogue systems has primarily focused on either text or table information. In more realistic scenarios, having a joint understanding of both is critical as knowledge is typically distributed over both unstructured and structured forms. We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions. We propose retrieval, system state tracking, and dialogue response generation tasks for our dataset and conduct baseline experiments for each. Our results show that there is still ample opportunity for improvement, demonstrating the importance of building stronger dialogue systems that can reason over the complex setting of information-seeking dialogue grounded on tables and text. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: Findings of ACL 2022

arXiv:2204.05959 [pdf]

"Smarter" NICs for faster molecular dynamics: a case study

Authors: Sara Karamati, Clayton Hughes, K. Scott Hemmert, Ryan E. Grant, W. Whit Schonbein, Scott Levy, Thomas M. Conte, Jeffrey Young, Richard W. Vuduc

Abstract: This work evaluates the benefits of using a "smart" network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standar… ▽ More This work evaluates the benefits of using a "smart" network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standard Intel server host using microbenchmarks and MiniMD. In MiniMD, we identify two distinct classes of computation, namely core computation and maintenance computation, which are executed in sequence. We restructure the algorithm and code to weaken this dependence and increase task parallelism, thereby making it possible to increase utilization of the BlueField-2 concurrently with the host. We evaluate our implementation on a cluster consisting of 16 dual-socket Intel Broadwell host nodes with one BlueField-2 per host-node. Our results show that while the overall compute performance of BlueField-2 is limited, using them with a modified MiniMD algorithm allows for up to 20% speedup over the host CPU baseline with no loss in simulation accuracy. △ Less

Submitted 12 April, 2022; originally announced April 2022.

arXiv:2201.11153 [pdf, other]

Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains

Authors: Alon Albalak, Sharon Levy, William Yang Wang

Abstract: Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstr… ▽ More Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems. In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19. Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets. We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting. We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: 6 pages, 8 figures

arXiv:2110.13819 [pdf, other]

CloudFindr: A Deep Learning Cloud Artifact Masker for Satellite DEM Data

Authors: Kalina Borkiewicz, Viraj Shah, J. P. Naiman, Chuanyue Shen, Stuart Levy, Jeff Carpenter

Abstract: Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net.… ▽ More Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net. Compared to previous methods, our approach does not require multi-channel spectral imagery but performs successfully on single-channel Digital Elevation Models (DEMs). DEMs are a representation of the topography of the Earth and have a variety applications including planetary science, geology, flood modeling, and city planning. △ Less

Submitted 26 October, 2021; originally announced October 2021.

arXiv:2110.06962 [pdf, other]

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Authors: Sharon Levy, Kevin Mo, Wenhan Xiong, William Yang Wang

Abstract: Since late 2019, COVID-19 has quickly emerged as the newest biomedical domain, resulting in a surge of new information. As with other emergent domains, the discussion surrounding the topic has been rapidly changing, leading to the spread of misinformation. This has created the need for a public space for users to ask questions and receive credible, scientific answers. To fulfill this need, we turn… ▽ More Since late 2019, COVID-19 has quickly emerged as the newest biomedical domain, resulting in a surge of new information. As with other emergent domains, the discussion surrounding the topic has been rapidly changing, leading to the spread of misinformation. This has created the need for a public space for users to ask questions and receive credible, scientific answers. To fulfill this need, we turn to the task of open-domain question-answering, which we can use to efficiently find answers to free-text questions from a large set of documents. In this work, we present such a system for the emergent domain of COVID-19. Despite the small data size available, we are able to successfully train the system to retrieve answers from a large-scale corpus of published COVID-19 scientific papers. Furthermore, we incorporate effective re-ranking and question-answering techniques, such as document diversity and multiple answer spans. Our open-domain question-answering system can further act as a model for the quick development of similar systems that can be adapted and modified for other develo** emergent domains. △ Less

Submitted 13 October, 2021; originally announced October 2021.

Comments: EMNLP 2021 Demo

arXiv:2109.08490 [pdf, other]

Integrating Deep Reinforcement and Supervised Learning to Expedite Indoor Map**

Authors: Elchanan Zwecher, Eran Iceland, Sean R. Levy, Shmuel Y. Hayoun, Oren Gal, Ariel Barel

Abstract: The challenge of map** indoor environments is addressed. Typical heuristic algorithms for solving the motion planning problem are frontier-based methods, that are especially effective when the environment is completely unknown. However, in cases where prior statistical data on the environment's architectonic features is available, such algorithms can be far from optimal. Furthermore, their calcu… ▽ More The challenge of map** indoor environments is addressed. Typical heuristic algorithms for solving the motion planning problem are frontier-based methods, that are especially effective when the environment is completely unknown. However, in cases where prior statistical data on the environment's architectonic features is available, such algorithms can be far from optimal. Furthermore, their calculation time may increase substantially as more areas are exposed. In this paper we propose two means by which to overcome these shortcomings. One is the use of deep reinforcement learning to train the motion planner. The second is the inclusion of a pre-trained generative deep neural network, acting as a map predictor. Each one helps to improve the decision making through use of the learned structural statistics of the environment, and both, being realized as neural networks, ensure a constant calculation time. We show that combining the two methods can shorten the duration of the map** process by up to 4 times, compared to frontier-based motion planning. △ Less

Submitted 27 February, 2022; v1 submitted 17 September, 2021; originally announced September 2021.

Comments: Accepted to ICRA-22 conference (23-27 May, 2022)

arXiv:2109.03858 [pdf, other]

Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Authors: Shahar Levy, Koren Lazar, Gabriel Stanovsky

Abstract: Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gende… ▽ More Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at www.github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings. △ Less

Submitted 10 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: Accepted to Findings of EMNLP 2021

arXiv:2101.00433 [pdf, other]

doi 10.18653/v1/2021.emnlp-main.153

Modeling Disclosive Transparency in NLP Application Descriptions

Authors: Michael Saxon, Sharon Levy, Xinyi Wang, Alon Albalak, William Yang Wang

Abstract: Broader disclosive transparency$-$truth and clarity in communication regarding the function of AI systems$-$is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where "too much informa… ▽ More Broader disclosive transparency$-$truth and clarity in communication regarding the function of AI systems$-$is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where "too much information" clouds a reader's understanding of what a system description means. Disclosive transparency's subjective nature has rendered deep study into these problems and their remedies difficult. To improve this state of affairs, We introduce neural language model-based probabilistic metrics to directly model disclosive transparency, and demonstrate that they correlate with user and expert opinions of system transparency, making them a valid objective proxy. Finally, we demonstrate the use of these metrics in a pilot study quantifying the relationships between transparency, confusion, and user perceptions in a corpus of real NLP system descriptions. △ Less

Submitted 10 September, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

Comments: To appear at EMNLP 2021. 15 pages, 10 figures, 7 tables

Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 2023-2037

arXiv:2101.00379 [pdf, other]

Investigating Memorization of Conspiracy Theories in Text Generation

Authors: Sharon Levy, Michael Saxon, William Yang Wang

Abstract: The adoption of natural language generation (NLG) models can leave individuals vulnerable to the generation of harmful information memorized by the models, such as conspiracy theories. While previous studies examine conspiracy theories in the context of social media, they have not evaluated their presence in the new space of generative language models. In this work, we investigate the capability o… ▽ More The adoption of natural language generation (NLG) models can leave individuals vulnerable to the generation of harmful information memorized by the models, such as conspiracy theories. While previous studies examine conspiracy theories in the context of social media, they have not evaluated their presence in the new space of generative language models. In this work, we investigate the capability of language models to generate conspiracy theory text. Specifically, we aim to answer: can we test pretrained generative language models for the memorization and elicitation of conspiracy theories without access to the model's training data? We highlight the difficulties of this task and discuss it in the context of memorization, generalization, and hallucination. Utilizing a new dataset consisting of conspiracy theory topics and machine-generated conspiracy theories helps us discover that many conspiracy theories are deeply rooted in the pretrained language models. Our experiments demonstrate a relationship between model parameters such as size and temperature and their propensity to generate conspiracy theory text. These results indicate the need for a more thorough review of NLG applications before release and an in-depth discussion of the drawbacks of memorization in generative language models. △ Less

Submitted 8 June, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

Comments: ACL 2021 Findings

arXiv:2011.02043 [pdf, other]

Deep-Learning-Aided Path Planning and Map Construction for Expediting Indoor Map**

Authors: Elchanan Zwecher, Eran Iceland, Shmuel Y. Hayoun, Ahavatya Revivo, Sean R. Levy, Ariel Barel

Abstract: The problem of autonomous indoor map** is addressed. The goal is to minimize the time to achieve a predefined percentage of exposure with some desired level of certainty. The use of a pre-trained generative deep neural network, acting as a map predictor, in both the path planning and the map construction is proposed in order to expedite the map** process. This method is examined in combination… ▽ More The problem of autonomous indoor map** is addressed. The goal is to minimize the time to achieve a predefined percentage of exposure with some desired level of certainty. The use of a pre-trained generative deep neural network, acting as a map predictor, in both the path planning and the map construction is proposed in order to expedite the map** process. This method is examined in combination with several frontier-based path planners for two distinct floorplan datasets. Simulations are run for several configurations of the integrated map predictor, the results of which reveal that by utilizing the prediction a significant reduction in map** time is possible. When the prediction is integrated in both path planning and map construction processes it is shown that the map** time may in some cases be cut by over 50%. △ Less

Submitted 13 August, 2022; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Submitted to Robotics and Autonomous Systems journal

arXiv:2010.02510 [pdf, other]

Investigating African-American Vernacular English in Transformer-Based Text Generation

Authors: Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita Honnavalli, Sharon Levy, Diba Mirza, William Yang Wang

Abstract: The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset o… ▽ More The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs, thereby isolating syntactic structure and AAVE- or SAE-specific language for each pair. We evaluate each sample and its GPT-2 generated text with pretrained sentiment classifiers and find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both. Additionally, we conduct human evaluation of AAVE and SAE text generated with GPT-2 to compare contextual rigor and overall quality. △ Less

Submitted 29 October, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: 7 pages, EMNLP 2020

arXiv:2007.15759 [pdf, other]

The Program with a Personality: Analysis of Elk Cloner, the First Personal Computer Virus

Authors: Scott Levy, Jedidiah R. Crandall

Abstract: Although self-replicating programs and viruses have existed since the 1960s and 70s, Elk Cloner was the first virus to circulate among personal computers in the wild. Despite its historical significance, it received comparatively little attention when it first appeared in 1982. In this paper, we: present the first detailed examination of the operation and structure of Elk Cloner; discuss the effec… ▽ More Although self-replicating programs and viruses have existed since the 1960s and 70s, Elk Cloner was the first virus to circulate among personal computers in the wild. Despite its historical significance, it received comparatively little attention when it first appeared in 1982. In this paper, we: present the first detailed examination of the operation and structure of Elk Cloner; discuss the effect of environmental characteristics on its virulence; and provide supporting evidence for several hypotheses about why its release was largely ignored in the early 1980s. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:2006.03202 [pdf, other]

Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment

Authors: Sharon Levy, William Yang Wang

Abstract: The spread of COVID-19 has become a significant and troubling aspect of society in 2020. With millions of cases reported across countries, new outbreaks have occurred and followed patterns of previously affected areas. Many disease detection models do not incorporate the wealth of social media data that can be utilized for modeling and predicting its spread. In this case, it is useful to ask, can… ▽ More The spread of COVID-19 has become a significant and troubling aspect of society in 2020. With millions of cases reported across countries, new outbreaks have occurred and followed patterns of previously affected areas. Many disease detection models do not incorporate the wealth of social media data that can be utilized for modeling and predicting its spread. In this case, it is useful to ask, can we utilize this knowledge in one country to model the outbreak in another? To answer this, we propose the task of cross-lingual transfer learning for epidemiological alignment. Utilizing both macro and micro text features, we train on Italy's early COVID-19 outbreak through Twitter and transfer to several other countries. Our experiments show strong results with up to 0.85 Spearman correlation in cross-country predictions. △ Less

Submitted 15 October, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

arXiv:2006.00084 [pdf, other]

Clustering-informed Cinematic Astrophysical Data Visualization with Application to the Moon-forming Terrestrial Synestia

Authors: Patrick D. Aleo, Simon J. Lock, Donna J. Cox, Stuart A. Levy, J. P. Naiman, A. J. Christensen, Kalina Borkiewicz, Robert Patterson

Abstract: Scientific visualization tools are currently not optimized to create cinematic, production-quality representations of numerical data for the purpose of science communication. In our pipeline \texttt{Estra}, we outline a step-by-step process from a raw simulation into a finished render as a way to teach non-experts in the field of visualization how to achieve production-quality outputs on their own… ▽ More Scientific visualization tools are currently not optimized to create cinematic, production-quality representations of numerical data for the purpose of science communication. In our pipeline \texttt{Estra}, we outline a step-by-step process from a raw simulation into a finished render as a way to teach non-experts in the field of visualization how to achieve production-quality outputs on their own. We demonstrate feasibility of using the visual effects software Houdini for cinematic astrophysical data visualization, informed by machine learning clustering algorithms. To demonstrate the capabilities of this pipeline, we used a post-impact, thermally-equilibrated Moon-forming synestia from \cite{Lock18}. Our approach aims to identify "physically interpretable" clusters, where clusters identified in an appropriate phase space (e.g. here we use a temperature-entropy phase-space) correspond to physically meaningful structures within the simulation data. Clustering results can then be used to highlight these structures by informing the color-map** process in a simplified Houdini software shading network, where dissimilar phase-space clusters are mapped to different color values for easier visual identification. Cluster information can also be used in 3D position space, via Houdini's Scene View, to aid in physical cluster finding, simulation prototy**, and data exploration. Our clustering-based renders are compared to those created by the Advanced Visualization Lab (AVL) team for the full dome show "Imagine the Moon" as proof of concept. With \texttt{Estra}, scientists have a tool to create their own production-quality, data-driven visualizations. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: 19 pages, 16 figures, submitted to MNRAS

arXiv:2004.13939 [pdf, ps, other]

Evaluating Transformer-Based Multilingual Text Classification

Authors: Sophie Groenwold, Samhita Honnavalli, Lily Ou, Aesha Parekh, Sharon Levy, Diba Mirza, William Yang Wang

Abstract: As NLP tools become ubiquitous in today's technological landscape, they are increasingly applied to languages with a variety of typological structures. However, NLP research does not focus primarily on typological differences in its analysis of state-of-the-art language models. As a result, NLP tools perform unequally across languages with different syntactic and morphological structures. Through… ▽ More As NLP tools become ubiquitous in today's technological landscape, they are increasingly applied to languages with a variety of typological structures. However, NLP research does not focus primarily on typological differences in its analysis of state-of-the-art language models. As a result, NLP tools perform unequally across languages with different syntactic and morphological structures. Through a detailed discussion of word order typology, morphological typology, and comparative linguistics, we identify which variables most affect language modeling efficacy; in addition, we calculate word order and morphological similarity indices to aid our empirical study. We then use this background to support our analysis of an experiment we conduct using multi-class text classification on eight languages and eight models. △ Less

Submitted 30 April, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Total of 15 pages (9 pages for paper, 2 pages for references, 4 pages for appendix). Changed title

arXiv:1911.03854 [pdf, other]

r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection

Authors: Kai Nakamura, Sharon Levy, William Yang Wang

Abstract: Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news res… ▽ More Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the novel aspect of multimodality and fine-grained classification unique to Fakeddit. △ Less

Submitted 12 March, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

Comments: Accepted LREC 2020

arXiv:1910.07130 [pdf, other]

SCG: Spotting Coordinated Groups in Social Media

Authors: Junhao Wang, Sacha Levy, Ren Wang, Aayushi Kulshrestha, Reihaneh Rabbany

Abstract: Recent events have led to a burgeoning awareness on the misuse of social media sites to affect political events, sway public opinion, and confuse the voters. Such serious, hostile mass manipulation has motivated a large body of works on bots/troll detection and fake news detection, which mostly focus on classifying at the user level based on the content generated by the users. In this study, we jo… ▽ More Recent events have led to a burgeoning awareness on the misuse of social media sites to affect political events, sway public opinion, and confuse the voters. Such serious, hostile mass manipulation has motivated a large body of works on bots/troll detection and fake news detection, which mostly focus on classifying at the user level based on the content generated by the users. In this study, we jointly analyze the connections among the users, as well as the content generated by them to Spot Coordinated Groups (SCG), sets of users that are likely to be organized towards impacting the general discourse. Given their tiny size (relative to the whole data), detecting these groups is computationally hard. Our proposed method detects these tiny-clusters effectively and efficiently. We deploy our SCG method to summarize and explain the coordinated groups on Twitter around the 2019 Canadian Federal Elections, by analyzing over 60 thousand user accounts with 3.4 million followership connections, and 1.3 million unique hashtags in the content of their tweets. The users in the detected coordinated groups are over 4x more likely to get suspended, whereas the hashtags which characterize their creed are linked to misinformation campaigns. △ Less

Submitted 1 September, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

arXiv:1811.01147 [pdf, other]

SafeRoute: Learning to Navigate Streets Safely in an Urban Environment

Authors: Sharon Levy, Wenhan Xiong, Elizabeth Belding, William Yang Wang

Abstract: Recent studies show that 85% of women have changed their traveled route to avoid harassment and assault. Despite this, current map** tools do not empower users with information to take charge of their personal safety. We propose SafeRoute, a novel solution to the problem of navigating cities and avoiding street harassment and crime. Unlike other street navigation applications, SafeRoute introduc… ▽ More Recent studies show that 85% of women have changed their traveled route to avoid harassment and assault. Despite this, current map** tools do not empower users with information to take charge of their personal safety. We propose SafeRoute, a novel solution to the problem of navigating cities and avoiding street harassment and crime. Unlike other street navigation applications, SafeRoute introduces a new type of path generation via deep reinforcement learning. This enables us to successfully optimize for multi-criteria path-finding and incorporate representation learning within our framework. Our agent learns to pick favorable streets to create a safe and short path with a reward function that incorporates safety and efficiency. Given access to recent crime reports in many urban cities, we train our model for experiments in Boston, New York, and San Francisco. We test our model on areas of these cities, specifically the populated downtown regions where tourists and those unfamiliar with the streets walk. We evaluate SafeRoute and successfully improve over state-of-the-art methods by up to 17% in local average distance from crimes while decreasing path length by up to 7%. △ Less

Submitted 2 November, 2018; originally announced November 2018.

Comments: 8 pages

arXiv:1611.03056 [pdf, other]

doi 10.1007/978-3-319-24858-5_8

Intrusion Detection System for Applications using Linux Containers

Authors: Amr S. Abed, Charles Clancy, David S. Levy

Abstract: Linux containers are gaining increasing traction in both individual and industrial use, and as these containers get integrated into mission-critical systems, real-time detection of malicious cyber attacks becomes a critical operational requirement. This paper introduces a real-time host-based intrusion detection system that can be used to passively detect malfeasance against applications within Li… ▽ More Linux containers are gaining increasing traction in both individual and industrial use, and as these containers get integrated into mission-critical systems, real-time detection of malicious cyber attacks becomes a critical operational requirement. This paper introduces a real-time host-based intrusion detection system that can be used to passively detect malfeasance against applications within Linux containers running in a standalone or in a cloud multi-tenancy environment. The demonstrated intrusion detection system uses bags of system calls monitored from the host kernel for learning the behavior of an application running within a Linux container and determining anomalous container behavior. Performance of the approach using a database application was measured and results are discussed. △ Less

Submitted 9 November, 2016; originally announced November 2016.

Comments: The final publication is available at http://link.springer.com/chapter/10.1007%2F978-3-319-24858-5_8. arXiv admin note: substantial text overlap with arXiv:1611.03053

Journal ref: STM 2015. LNCS, vol. 9331, pp. 123-135. Springer, Heidelberg (2015)

arXiv:1611.03053 [pdf, other]

doi 10.1109/GLOCOMW.2015.7414047

Applying Bag of System Calls for Anomalous Behavior Detection of Applications in Linux Containers

Authors: Amr S. Abed, T. Charles Clancy, David S. Levy

Abstract: In this paper, we present the results of using bags of system calls for learning the behavior of Linux containers for use in anomaly-detection based intrusion detection system. By using system calls of the containers monitored from the host kernel for anomaly detection, the system does not require any prior knowledge of the container nature, neither does it require altering the container or the ho… ▽ More In this paper, we present the results of using bags of system calls for learning the behavior of Linux containers for use in anomaly-detection based intrusion detection system. By using system calls of the containers monitored from the host kernel for anomaly detection, the system does not require any prior knowledge of the container nature, neither does it require altering the container or the host kernel. △ Less

Submitted 9 November, 2016; originally announced November 2016.

Comments: Published version available on IEEE Xplore (http://ieeexplore.ieee.org/document/7414047/) arXiv admin note: substantial text overlap with arXiv:1611.03056

Journal ref: 2015 IEEE Globecom Workshops (GC Wkshps), San Diego, CA, 2015, pp. 1-5

Showing 1–36 of 36 results for author: Levy, S