Search | arXiv e-print repository

CausalNLP: A Practical Toolkit for Causal Inference with Text

Abstract: Causal inference is the process of estimating the effect or impact of a treatment on an outcome with other covariates as potential confounders (and mediators) that may need to be controlled. The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, enrollment). In this paper, we present Ca… ▽ More Causal inference is the process of estimating the effect or impact of a treatment on an outcome with other covariates as potential confounders (and mediators) that may need to be controlled. The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, enrollment). In this paper, we present CausalNLP, a toolkit for inferring causality with observational data that includes text in addition to traditional numerical and categorical variables. CausalNLP employs the use of meta learners for treatment effect estimation and supports using raw text and its linguistic properties as a treatment, an outcome, or a "controlled-for" variable (e.g., confounder). The library is open source and available at: https://github.com/amaiya/causalnlp. △ Less

Submitted 3 May, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: 9 pages

arXiv:2004.10703 [pdf, ps, other]

ktrain: A Low-Code Library for Augmented Machine Learning

Authors: Arun S. Maiya

Abstract: We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and apply by both beginners and experienced practitioners. Featuring modules… ▽ More We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and apply by both beginners and experienced practitioners. Featuring modules that support text data (e.g., text classification, sequence tagging, open-domain question-answering), vision data (e.g., image classification), graph data (e.g., node classification, link prediction), and tabular data, ktrain presents a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four "commands" or lines of code. △ Less

Submitted 5 April, 2022; v1 submitted 19 April, 2020; originally announced April 2020.

Comments: 9 pages

arXiv:1508.05902 [pdf, ps, other]

A Framework for Comparing Groups of Documents

Authors: Arun S. Maiya

Abstract: We present a general framework for comparing multiple groups of documents. A bipartite graph model is proposed where document groups are represented as one node set and the comparison criteria are represented as the other node set. Using this model, we present basic algorithms to extract insights into similarities and differences among the document groups. Finally, we demonstrate the versatility o… ▽ More We present a general framework for comparing multiple groups of documents. A bipartite graph model is proposed where document groups are represented as one node set and the comparison criteria are represented as the other node set. Using this model, we present basic algorithms to extract insights into similarities and differences among the document groups. Finally, we demonstrate the versatility of our framework through an analysis of NSF funding programs for basic research. △ Less

Submitted 24 August, 2015; originally announced August 2015.

Comments: 6 pages; 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP '15)

ACM Class: I.2.7

arXiv:1505.01072 [pdf, ps, other]

Mining Measured Information from Text

Authors: Arun S. Maiya, Dale Visser, Andrew Wan

Abstract: We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value… ▽ More We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 μm"). Finally, we present MQSearch: the realization of a search engine with full support for measured information. △ Less

Submitted 5 May, 2015; originally announced May 2015.

Comments: 4 pages; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15)

ACM Class: I.2.7; H.3.3

arXiv:1409.7591 [pdf, ps, other]

Topic Similarity Networks: Visual Analytics for Large Document Sets

Authors: Arun S. Maiya, Robert M. Rolfe

Abstract: We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networ… ▽ More We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia. △ Less

Submitted 26 September, 2014; originally announced September 2014.

Comments: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData 2014)

ACM Class: I.2.6; I.2.7; H.5.2

arXiv:1308.2359 [pdf, ps, other]

Exploratory Analysis of Highly Heterogeneous Document Collections

Authors: Arun S. Maiya, John P. Thompson, Francisco Loaiza-Lemos, Robert M. Rolfe

Abstract: We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural lan… ▽ More We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Reports and Articles). KERA extracts topic-representative terms from individual documents in a purely unsupervised fashion and is revealed to be significantly more effective than state-of-the-art methods. Finally, we evaluate our system in its ability to help users locate documents pertaining to military critical technologies buried deep in a large heterogeneous sea of information. △ Less

Submitted 10 August, 2013; originally announced August 2013.

Comments: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ACM Class: I.2.7; H.3.3; H.5.2

arXiv:1109.3911 [pdf, ps, other]

Benefits of Bias: Towards Better Characterization of Network Sampling

Authors: Arun S. Maiya, Tanya Y. Berger-Wolf

Abstract: From social networks to P2P systems, network sampling arises in many settings. We present a detailed study on the nature of biases in network sampling strategies to shed light on how best to sample from networks. We investigate connections between specific biases and various measures of structural representativeness. We show that certain biases are, in fact, beneficial for many applications, as th… ▽ More From social networks to P2P systems, network sampling arises in many settings. We present a detailed study on the nature of biases in network sampling strategies to shed light on how best to sample from networks. We investigate connections between specific biases and various measures of structural representativeness. We show that certain biases are, in fact, beneficial for many applications, as they "push" the sampling process towards inclusion of desired properties. Finally, we describe how these sampling biases can be exploited in several, real-world applications including disease outbreak detection and market research. △ Less

Submitted 18 September, 2011; originally announced September 2011.

Comments: 9 pages; KDD 2011: 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ACM Class: H.2.8

arXiv:1009.4383 [pdf, ps, other]

Expansion and Search in Networks

Authors: Arun S. Maiya, Tanya Y. Berger-Wolf

Abstract: Borrowing from concepts in expander graphs, we study the expansion properties of real-world, complex networks (e.g. social networks, unstructured peer-to-peer or P2P networks) and the extent to which these properties can be exploited to understand and address the problem of decentralized search. We first produce samples that concisely capture the overall expansion properties of an entire network,… ▽ More Borrowing from concepts in expander graphs, we study the expansion properties of real-world, complex networks (e.g. social networks, unstructured peer-to-peer or P2P networks) and the extent to which these properties can be exploited to understand and address the problem of decentralized search. We first produce samples that concisely capture the overall expansion properties of an entire network, which we collectively refer to as the expansion signature. Using these signatures, we find a correspondence between the magnitude of maximum expansion and the extent to which a network can be efficiently searched. We further find evidence that standard graph-theoretic measures, such as average path length, fail to fully explain the level of "searchability" or ease of information diffusion and dissemination in a network. Finally, we demonstrate that this high expansion can be leveraged to facilitate decentralized search in networks and show that an expansion-based search strategy outperforms typical search methods. △ Less

Submitted 1 September, 2011; v1 submitted 22 September, 2010; originally announced September 2010.

Comments: 10 pages

ACM Class: H.2.8; H.3.3

Journal ref: CIKM 2010: 19th ACM International Conference on Information and Knowledge Management

Showing 1–8 of 8 results for author: Maiya, A S