Search | arXiv e-print repository

Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

Authors: Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman, Bismarck Bamfo Odoom, Sanjeev Khudanpur, Stephen D. Richardson, Kenton Murray

Abstract: A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We pr… ▽ More A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions. △ Less

Submitted 13 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: NAACL 2024

arXiv:2402.09286 [pdf]

doi 10.1093/jamia/ocae102

Nutrition Facts, Drug Facts, and Model Facts: Putting AI Ethics into Practice in Gun Violence Research

Authors: Jessica Zhu, Dr. Michel Cukier, Dr. Joseph Richardson Jr

Abstract: Objective: Firearm injury research necessitates using data from often-exploited vulnerable populations of Black and Brown Americans. In order to minimize distrust, this study provides a framework for establishing AI trust and transparency with the general population. Methods: We propose a Model Facts template that is easily extendable and decomposes accuracy and demographics into standardized and… ▽ More Objective: Firearm injury research necessitates using data from often-exploited vulnerable populations of Black and Brown Americans. In order to minimize distrust, this study provides a framework for establishing AI trust and transparency with the general population. Methods: We propose a Model Facts template that is easily extendable and decomposes accuracy and demographics into standardized and minimally complex values. This framework allows general users to assess the validity and biases of a model without diving into technical model documentation. Examples: We apply the Model Facts template on two previously published models, a violence risk identification model and a suicide risk prediction model. We demonstrate the ease of accessing the appropriate information when the data is structured appropriately. Discussion: The Model Facts template is limited in its current form to human based data and biases. Like nutrition facts, it also will require some educational resources for users to grasp its full utility. Human computer interaction experiments should be conducted to ensure that the interaction between user interface and model interface is as desired. Conclusion: The Model Facts label is the first framework dedicated to establishing trust with end users and general population consumers. Implementation of Model Facts into firearm injury research will provide public health practitioners and those impacted by firearm injury greater faith in the tools the research provides. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2311.03425 [pdf]

An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in Healthcare Datasets

Authors: Faris F. Gulamali, Ashwin S. Sawant, Lora Liharska, Carol R. Horowitz, Lili Chan, Patricia H. Kovatch, Ira Hofer, Karandeep Singh, Lynne D. Richardson, Emmanuel Mensah, Alexander W Charney, David L. Reich, Jianying Hu, Girish N. Nadkarni

Abstract: The adoption of diagnosis and prognostic algorithms in healthcare has led to concerns about the perpetuation of bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success. Here, we generate a data-centric, model-agnostic, task-agnostic ap… ▽ More The adoption of diagnosis and prognostic algorithms in healthcare has led to concerns about the perpetuation of bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success. Here, we generate a data-centric, model-agnostic, task-agnostic approach to evaluate dataset bias by investigating the relationship between how easily different groups are learned at small sample sizes (AEquity). We then apply a systematic analysis of AEq values across subpopulations to identify and mitigate manifestations of racial bias in two known cases in healthcare - Chest X-rays diagnosis with deep convolutional neural networks and healthcare utilization prediction with multivariate logistic regression. AEq is a novel and broadly applicable metric that can be applied to advance equity by diagnosing and remediating bias in healthcare datasets. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.14809 [pdf, other]

Learning spatio-temporal patterns with Neural Cellular Automata

Authors: Alex D. Richardson, Tibor Antal, Richard A. Blythe, Linus J. Schumacher

Abstract: Neural Cellular Automata (NCA) are a powerful combination of machine learning and mechanistic modelling. We train NCA to learn complex dynamics from time series of images and PDE trajectories. Our method is designed to identify underlying local rules that govern large scale dynamic emergent behaviours. Previous work on NCA focuses on learning rules that give stationary emergent structures. We exte… ▽ More Neural Cellular Automata (NCA) are a powerful combination of machine learning and mechanistic modelling. We train NCA to learn complex dynamics from time series of images and PDE trajectories. Our method is designed to identify underlying local rules that govern large scale dynamic emergent behaviours. Previous work on NCA focuses on learning rules that give stationary emergent structures. We extend NCA to capture both transient and stable structures within the same system, as well as learning rules that capture the dynamics of Turing pattern formation in nonlinear Partial Differential Equations (PDEs). We demonstrate that NCA can generalise very well beyond their PDE training data, we show how to constrain NCA to respect given symmetries, and we explore the effects of associated hyperparameters on model performance and stability. Being able to learn arbitrary dynamics gives NCA great potential as a data driven modelling framework, especially for modelling biological pattern formation. △ Less

Submitted 22 April, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: For videos referenced in appendix, see: https://github.com/AlexDR1998/NCA/tree/main/Videos

arXiv:2303.15656 [pdf, other]

Predicting Adverse Neonatal Outcomes for Preterm Neonates with Multi-Task Learning

Authors: **gyang Lin, Junyu Chen, Hanjia Lyu, Igor Khodak, Divya Chhabra, Colby L Day Richardson, Irina Prelipcean, Andrew M Dylag, Jiebo Luo

Abstract: Diagnosis of adverse neonatal outcomes is crucial for preterm survival since it enables doctors to provide timely treatment. Machine learning (ML) algorithms have been demonstrated to be effective in predicting adverse neonatal outcomes. However, most previous ML-based methods have only focused on predicting a single outcome, ignoring the potential correlations between different outcomes, and pote… ▽ More Diagnosis of adverse neonatal outcomes is crucial for preterm survival since it enables doctors to provide timely treatment. Machine learning (ML) algorithms have been demonstrated to be effective in predicting adverse neonatal outcomes. However, most previous ML-based methods have only focused on predicting a single outcome, ignoring the potential correlations between different outcomes, and potentially leading to suboptimal results and overfitting issues. In this work, we first analyze the correlations between three adverse neonatal outcomes and then formulate the diagnosis of multiple neonatal outcomes as a multi-task learning (MTL) problem. We then propose an MTL framework to jointly predict multiple adverse neonatal outcomes. In particular, the MTL framework contains shared hidden layers and multiple task-specific branches. Extensive experiments have been conducted using Electronic Health Records (EHRs) from 121 preterm neonates. Empirical results demonstrate the effectiveness of the MTL framework. Furthermore, the feature importance is analyzed for each neonatal outcome, providing insights into model interpretability. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2208.03900 [pdf, other]

doi 10.1109/TVCG.2022.3209495

The Influence of Visual Provenance Representations on Strategies in a Collaborative Hand-off Data Analysis Scenario

Authors: Jeremy E. Block, Shaghayegh Esmaeili, Eric D. Ragan, John R. Goodall, G. David Richardson

Abstract: Conducting data analysis tasks rarely occur in isolation. Especially in intelligence analysis scenarios where different experts contribute knowledge to a shared understanding, members must communicate how insights develop to establish common ground among collaborators. The use of provenance to communicate analytic sensemaking carries promise by describing the interactions and summarizing the steps… ▽ More Conducting data analysis tasks rarely occur in isolation. Especially in intelligence analysis scenarios where different experts contribute knowledge to a shared understanding, members must communicate how insights develop to establish common ground among collaborators. The use of provenance to communicate analytic sensemaking carries promise by describing the interactions and summarizing the steps taken to reach insights. Yet, no universal guidelines exist for communicating provenance in different settings. Our work focuses on the presentation of provenance information and the resulting conclusions reached and strategies used by new analysts. In an open-ended, 30-minute, textual exploration scenario, we qualitatively compare how adding different types of provenance information (specifically data coverage and interaction history) affects analysts' confidence in conclusions developed, propensity to repeat work, filtering of data, identification of relevant information, and typical investigation strategies. We see that data coverage (i.e., what was interacted with) provides provenance information without limiting individual investigation freedom. On the other hand, while interaction history (i.e., when something was interacted with) does not significantly encourage more mimicry, it does take more time to comfortably understand, as represented by less confident conclusions and less relevant information-gathering behaviors. Our results contribute empirical data towards understanding how provenance summarizations can influence analysis behaviors. △ Less

Submitted 7 August, 2022; originally announced August 2022.

Comments: to be published in IEEE Vis 2022

Journal ref: IEEE Transactions on Visualization and Computer Graphics 2022

arXiv:2008.00017 [pdf]

Safety, Security, and Privacy Threats Posed by Accelerating Trends in the Internet of Things

Authors: Kevin Fu, Tadayoshi Kohno, Daniel Lopresti, Elizabeth Mynatt, Klara Nahrstedt, Shwetak Patel, Debra Richardson, Ben Zorn

Abstract: The Internet of Things (IoT) is already transforming industries, cities, and homes. The economic value of this transformation across all industries is estimated to be trillions of dollars and the societal impact on energy efficiency, health, and productivity are enormous. Alongside potential benefits of interconnected smart devices comes increased risk and potential for abuse when embedding sensin… ▽ More The Internet of Things (IoT) is already transforming industries, cities, and homes. The economic value of this transformation across all industries is estimated to be trillions of dollars and the societal impact on energy efficiency, health, and productivity are enormous. Alongside potential benefits of interconnected smart devices comes increased risk and potential for abuse when embedding sensing and intelligence into every device. One of the core problems with the increasing number of IoT devices is the increased complexity that is required to operate them safely and securely. This increased complexity creates new safety, security, privacy, and usability challenges far beyond the difficult challenges individuals face just securing a single device. We highlight some of the negative trends that smart devices and collections of devices cause and we argue that issues related to security, physical safety, privacy, and usability are tightly interconnected and solutions that address all four simultaneously are needed. Tight safety and security standards for individual devices based on existing technology are needed. Likewise research that determines the best way for individuals to confidently manage collections of devices must guide the future deployments of such systems. △ Less

Submitted 31 July, 2020; originally announced August 2020.

Comments: A Computing Community Consortium (CCC) white paper, 9 pages

arXiv:2007.03704 [pdf]

Computer-Aided Personalized Education

Authors: Rajeev Alur, Richard Baraniuk, Rastislav Bodik, Ann Drobnis, Sumit Gulwani, Bjoern Hartmann, Yasmin Kafai, Jeff Karpicke, Ran Libeskind-Hadas, Debra Richardson, Armando Solar-Lezama, Candace Thille, Moshe Vardi

Abstract: The shortage of people trained in STEM fields is becoming acute, and universities and colleges are straining to satisfy this demand. In the case of computer science, for instance, the number of US students taking introductory courses has grown three-fold in the past decade. Recently, massive open online courses (MOOCs) have been promoted as a way to ease this strain. This at best provides access t… ▽ More The shortage of people trained in STEM fields is becoming acute, and universities and colleges are straining to satisfy this demand. In the case of computer science, for instance, the number of US students taking introductory courses has grown three-fold in the past decade. Recently, massive open online courses (MOOCs) have been promoted as a way to ease this strain. This at best provides access to education. The bigger challenge though is co** with heterogeneous backgrounds of different students, retention, providing feedback, and assessment. Personalized education relying on computational tools can address this challenge. While automated tutoring has been studied at different times in different communities, recent advances in computing and education technology offer exciting opportunities to transform the manner in which students learn. In particular, at least three trends are significant. First, progress in logical reasoning, data analytics, and natural language processing has led to tutoring tools for automatic assessment, personalized instruction including targeted feedback, and adaptive content generation for a variety of subjects. Second, research in the science of learning and human-computer interaction is leading to a better understanding of how different students learn, when and what types of interventions are effective for different instructional goals, and how to measure the success of educational tools. Finally, the recent emergence of online education platforms, both in academia and industry, is leading to new opportunities for the development of a shared infrastructure. This CCC workshop brought together researchers develo** educational tools based on technologies such as logical reasoning and machine learning with researchers in education, human-computer interaction, and cognitive psychology. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: A Computing Community Consortium (CCC) workshop report, 12 pages

Report number: ccc2016report_6

arXiv:1811.03482 [pdf, ps, other]

A holistic look at requirements engineering practices in the gaming industry

Authors: Aftab Hussain, Omar Asadi, Debra J. Richardson

Abstract: In this work we present an account of the status of requirements engineering in the gaming industry. Recent papers in the area were surveyed. Characterizations of the gaming industry were deliberated upon by portraying its relations with the market industry. Some research directions in the area of requirements engineering in the gaming industry were also mentioned. In this work we present an account of the status of requirements engineering in the gaming industry. Recent papers in the area were surveyed. Characterizations of the gaming industry were deliberated upon by portraying its relations with the market industry. Some research directions in the area of requirements engineering in the gaming industry were also mentioned. △ Less

Submitted 8 November, 2018; originally announced November 2018.

arXiv:1808.10742 [pdf, other]

Anomaly Detection in Cyber Network Data Using a Cyber Language Approach

Authors: Bartley D. Richardson, Benjamin J. Radford, Shawn E. Davis, Keegan Hines, David Pekarek

Abstract: As the amount of cyber data continues to grow, cyber network defenders are faced with increasing amounts of data they must analyze to ensure the security of their networks. In addition, new types of attacks are constantly being created and executed globally. Current rules-based approaches are effective at characterizing and flagging known attacks, but they typically fail when presented with a new… ▽ More As the amount of cyber data continues to grow, cyber network defenders are faced with increasing amounts of data they must analyze to ensure the security of their networks. In addition, new types of attacks are constantly being created and executed globally. Current rules-based approaches are effective at characterizing and flagging known attacks, but they typically fail when presented with a new attack or new types of data. By comparison, unsupervised machine learning offers distinct advantages by not requiring labeled data to learn from large amounts of network traffic. In this paper, we present a natural language-based technique (suffix trees) as applied to cyber anomaly detection. We illustrate one methodology to generate a language using cyber data features, and our experimental results illustrate positive preliminary results in applying this technique to flow-type data. As an underlying assumption to this work, we make the claim that malicious cyber actors leave observables in the data as they execute their attacks. This work seeks to identify those artifacts and exploit them to identify a wide range of cyber attacks without the need for labeled ground-truth data. △ Less

Submitted 15 August, 2018; originally announced August 2018.

arXiv:1805.03735 [pdf, other]

Sequence Aggregation Rules for Anomaly Detection in Computer Network Traffic

Authors: Benjamin J. Radford, Bartley D. Richardson, Shawn E. Davis

Abstract: We evaluate methods for applying unsupervised anomaly detection to cybersecurity applications on computer network traffic data, or flow. We borrow from the natural language processing literature and conceptualize flow as a sort of "language" spoken between machines. Five sequence aggregation rules are evaluated for their efficacy in flagging multiple attack types in a labeled flow dataset, CICIDS2… ▽ More We evaluate methods for applying unsupervised anomaly detection to cybersecurity applications on computer network traffic data, or flow. We borrow from the natural language processing literature and conceptualize flow as a sort of "language" spoken between machines. Five sequence aggregation rules are evaluated for their efficacy in flagging multiple attack types in a labeled flow dataset, CICIDS2017. For sequence modeling, we rely on long short-term memory (LSTM) recurrent neural networks (RNN). Additionally, a simple frequency-based model is described and its performance with respect to attack detection is compared to the LSTM models. We conclude that the frequency-based model tends to perform as well as or better than the LSTM models for the tasks at hand, with a few notable exceptions. △ Less

Submitted 14 May, 2018; v1 submitted 9 May, 2018; originally announced May 2018.

Comments: Prepared for the American Statistical Associations Symposium on Data Science and Statistics 2018

arXiv:1705.01993 [pdf]

Intelligent Infrastructure for Smart Agriculture: An Integrated Food, Energy and Water System

Authors: Shashi Shekhar, Joe Colletti, Francisco Muñoz-Arriola, Lakshmish Ramaswamy, Chandra Krintz, Lav Varshney, Debra Richardson

Abstract: Agriculture provides economic opportunity through innovation; helps rural America to thrive; promotes agricultural production that better nourishes Americans; and aims to preserve natural resources through healthy private working lands, conservation, improved watersheds, and restored forests. From agricultural production to food supply, agriculture supports rural and urban economies across the U.S… ▽ More Agriculture provides economic opportunity through innovation; helps rural America to thrive; promotes agricultural production that better nourishes Americans; and aims to preserve natural resources through healthy private working lands, conservation, improved watersheds, and restored forests. From agricultural production to food supply, agriculture supports rural and urban economies across the U.S. It accounts for 10% of U.S. jobs and is currently creating new jobs in the growing field of data-driven farming. However, U.S. global competitiveness associated with food and nutrition security is at risk because of accelerated investments by many other countries in agriculture, food, energy, and resource management. To ensure U.S. global competitiveness and long-term food security, it is imperative that we build sustainable physical and cyber infrastructures to enable self-managing and sustainable farming. Such infrastructures should enable next generation precision-farms by harnessing modern and emerging technologies such as small satellites, broadband Internet, tele-operation, augmented reality, advanced data analytics, sensors, and robotics. △ Less

Submitted 4 May, 2017; originally announced May 2017.

Comments: A Computing Community Consortium (CCC) white paper, 8 pages

Showing 1–12 of 12 results for author: Richardson, D