Search | arXiv e-print repository

arXiv:2005.06133 [pdf, other]

Adaptive Rule Discovery for Labeling Text Data

Authors: Sainyam Galhotra, Behzad Golshan, Wang-Chiew Tan

Abstract: Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of div… ▽ More Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. This is due to the fact that a random sample in imbalanced settings often contains only a few positive instances. To address these shortcomings, we present Darwin, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, Darwin automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how Darwin is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that Darwin enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by Darwin on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances. △ Less

Submitted 12 May, 2020; originally announced May 2020.

arXiv:2004.14283 [pdf, other]

SubjQA: A Dataset for Subjectivity and Review Comprehension

Authors: Johannes Bjerva, Nikita Bhutani, Behzad Golshan, Wang-Chiew Tan, Isabelle Augenstein

Abstract: Subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified, and has been shown to be important for sentiment analysis and word-sense disambiguation. Furthermore, subjectivity is an important aspect of user-generated data. In spite of this, subjectivity has not been investigated in contexts where such data is widespread, such as in question answe… ▽ More Subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified, and has been shown to be important for sentiment analysis and word-sense disambiguation. Furthermore, subjectivity is an important aspect of user-generated data. In spite of this, subjectivity has not been investigated in contexts where such data is widespread, such as in question answering (QA). We therefore investigate the relationship between subjectivity and QA, while develo** a new dataset. We compare and contrast with analyses from previous work, and verify that findings regarding subjectivity still hold when using recently developed NLP architectures. We find that subjectivity is also an important feature in the case of QA, albeit with more intricate interactions between subjectivity and QA performance. For instance, a subjective question may or may not be associated with a subjective answer. We release an English QA dataset (SubjQA) based on customer reviews, containing subjectivity annotations for questions and answer spans across 6 distinct domains. △ Less

Submitted 6 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: EMNLP 2020 Long Paper - Camera Ready

arXiv:2004.03020 [pdf, other]

Enhancing Review Comprehension with Domain-Specific Commonsense

Authors: Aaron Traylor, Chen Chen, Behzad Golshan, Xiaolan Wang, Yuliang Li, Yoshihiko Suhara, **feng Li, Cagatay Demiralp, Wang-Chiew Tan

Abstract: Review comprehension has played an increasingly important role in improving the quality of online services and products and commonsense knowledge can further enhance review comprehension. However, existing general-purpose commonsense knowledge bases lack sufficient coverage and precision to meaningfully improve the comprehension of domain-specific reviews. In this paper, we introduce xSense, an ef… ▽ More Review comprehension has played an increasingly important role in improving the quality of online services and products and commonsense knowledge can further enhance review comprehension. However, existing general-purpose commonsense knowledge bases lack sufficient coverage and precision to meaningfully improve the comprehension of domain-specific reviews. In this paper, we introduce xSense, an effective system for review comprehension using domain-specific commonsense knowledge bases (xSense KBs). We show that xSense KBs can be constructed inexpensively and present a knowledge distillation method that enables us to use xSense KBs along with BERT to boost the performance of various review comprehension tasks. We evaluate xSense over three review comprehension tasks: aspect extraction, aspect sentiment classification, and question answering. We find that xSense outperforms the state-of-the-art models for the first two tasks and improves the baseline BERT QA model significantly, demonstrating the usefulness of incorporating commonsense into review comprehension pipelines. To facilitate future research and applications, we publicly release three domain-specific knowledge bases and a domain-specific question answering benchmark along with this paper. △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: 8 pages

arXiv:1910.00637 [pdf, other]

Essentia: Mining Domain-Specific Paraphrases with Word-Alignment Graphs

Authors: Danni Ma, Chen Chen, Behzad Golshan, Wang-Chiew Tan

Abstract: Paraphrases are important linguistic resources for a wide variety of NLP applications. Many techniques for automatic paraphrase mining from general corpora have been proposed. While these techniques are successful at discovering generic paraphrases, they often fail to identify domain-specific paraphrases (e.g., {staff, concierge} in the hospitality domain). This is because current techniques are o… ▽ More Paraphrases are important linguistic resources for a wide variety of NLP applications. Many techniques for automatic paraphrase mining from general corpora have been proposed. While these techniques are successful at discovering generic paraphrases, they often fail to identify domain-specific paraphrases (e.g., {staff, concierge} in the hospitality domain). This is because current techniques are often based on statistical methods, while domain-specific corpora are too small to fit statistical methods. In this paper, we present an unsupervised graph-based technique to mine paraphrases from a small set of sentences that roughly share the same topic or intent. Our system, Essentia, relies on word-alignment techniques to create a word-alignment graph that merges and organizes tokens from input sentences. The resulting graph is then used to generate candidate paraphrases. We demonstrate that our system obtains high-quality paraphrases, as evaluated by crowd workers. We further show that the majority of the identified paraphrases are domain-specific and thus complement existing paraphrase databases. △ Less

Submitted 4 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

Comments: accepted at the 13th Workshop on Graph-Based Natural Language Processing

arXiv:1909.06731 [pdf, other]

Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

Authors: Wataru Hirota, Yoshihiko Suhara, Behzad Golshan, Wang-Chiew Tan

Abstract: We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via… ▽ More We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data. △ Less

Submitted 24 November, 2019; v1 submitted 15 September, 2019; originally announced September 2019.

Comments: AAAI 2020

arXiv:1811.05015 [pdf, other]

doi 10.1016/j.eswa.2018.10.046

A Team-Formation Algorithm for Faultline Minimization

Authors: Sanaz Bahargam, Behzad Golshan, Theodoros Lappas, Evimaria Terzi

Abstract: In recent years, the proliferation of online resumes and the need to evaluate large populations of candidates for on-site and virtual teams have led to a growing interest in automated team-formation. Given a large pool of candidates, the general problem requires the selection of a team of experts to complete a given task. Surprisingly, while ongoing research has studied numerous variations with di… ▽ More In recent years, the proliferation of online resumes and the need to evaluate large populations of candidates for on-site and virtual teams have led to a growing interest in automated team-formation. Given a large pool of candidates, the general problem requires the selection of a team of experts to complete a given task. Surprisingly, while ongoing research has studied numerous variations with different constraints, it has overlooked a factor with a well-documented impact on team cohesion and performance: team faultlines. Addressing this gap is challenging, as the available measures for faultlines in existing teams cannot be efficiently applied to faultline optimization. In this work, we meet this challenge with a new measure that can be efficiently used for both faultline measurement and minimization. We then use the measure to solve the problem of automatically partitioning a large population into low-faultline teams. By introducing faultlines to the team-formation literature, our work creates exciting opportunities for algorithmic work on faultline optimization, as well as on work that combines and studies the connection of faultlines with other influential team characteristics. △ Less

Submitted 12 November, 2018; originally announced November 2018.

arXiv:1805.01083 [pdf, other]

Scalable Semantic Querying of Text

Authors: Xiaolan Wang, Aaron Feng, Behzad Golshan, Alon Halevy, George Mihaila, Hidekazu Oiwa, Wang-Chiew Tan

Abstract: We present the KOKO system that takes declarative information extraction to a new level by incorporating advances in natural language processing techniques in its extraction language. KOKO is novel in that its extraction language simultaneously supports conditions on the surface of the text and on the structure of the dependency parse tree of sentences, thereby allowing for more refined extraction… ▽ More We present the KOKO system that takes declarative information extraction to a new level by incorporating advances in natural language processing techniques in its extraction language. KOKO is novel in that its extraction language simultaneously supports conditions on the surface of the text and on the structure of the dependency parse tree of sentences, thereby allowing for more refined extractions. KOKO also supports conditions that are forgiving to linguistic variation of expressing concepts and allows to aggregate evidence from the entire document in order to filter extractions. To scale up, KOKO exploits a multi-indexing scheme and heuristics for efficient extractions. We extensively evaluate KOKO over publicly available text corpora. We show that KOKO indices take up the smallest amount of space, are notably faster and more effective than a number of prior indexing schemes. Finally, we demonstrate KOKO's scale up on a corpus of 5 million Wikipedia articles. △ Less

Submitted 2 May, 2018; originally announced May 2018.

arXiv:1801.07746 [pdf, other]

HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments

Authors: Akari Asai, Sara Evensen, Behzad Golshan, Alon Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, Yinzhan Xu

Abstract: The science of happiness is an area of positive psychology concerned with understanding what behaviors make people happy in a sustainable fashion. Recently, there has been interest in develo** technologies that help incorporate the findings of the science of happiness into users' daily lives by steering them towards behaviors that increase happiness. With the goal of building technology that can… ▽ More The science of happiness is an area of positive psychology concerned with understanding what behaviors make people happy in a sustainable fashion. Recently, there has been interest in develo** technologies that help incorporate the findings of the science of happiness into users' daily lives by steering them towards behaviors that increase happiness. With the goal of building technology that can understand how people express their happy moments in text, we crowd-sourced HappyDB, a corpus of 100,000 happy moments that we make publicly available. This paper describes HappyDB and its properties, and outlines several important NLP problems that can be studied with the help of the corpus. We also apply several state-of-the-art analysis techniques to analyze HappyDB. Our results demonstrate the need for deeper NLP techniques to be developed which makes HappyDB an exciting resource for follow-on research. △ Less

Submitted 25 January, 2018; v1 submitted 23 January, 2018; originally announced January 2018.

Comments: Typos fixed

arXiv:1701.05352 [pdf, other]

Finding low-tension communities

Authors: Esther Galbrun, Behzad Golshan, Aristides Gionis, Evimaria Terzi

Abstract: Motivated by applications that arise in online social media and collaboration networks, there has been a lot of work on community-search and team-formation problems. In the former class of problems, the goal is to find a subgraph that satisfies a certain connectivity requirement and contains a given collection of seed nodes. In the latter class of problems, on the other hand, the goal is to find i… ▽ More Motivated by applications that arise in online social media and collaboration networks, there has been a lot of work on community-search and team-formation problems. In the former class of problems, the goal is to find a subgraph that satisfies a certain connectivity requirement and contains a given collection of seed nodes. In the latter class of problems, on the other hand, the goal is to find individuals who collectively have the skills required for a task and form a connected subgraph with certain properties. In this paper, we extend both the community-search and the team-formation problems by associating each individual with a profile. The profile is a numeric score that quantifies the position of an individual with respect to a topic. We adopt a model where each individual starts with a latent profile and arrives to a conformed profile through a dynamic conformation process, which takes into account the individual's social interaction and the tendency to conform with one's social environment. In this framework, social tension arises from the differences between the conformed profiles of neighboring individuals as well as from differences between individuals' conformed and latent profiles. Given a network of individuals, their latent profiles and this conformation process, we extend the community-search and the team-formation problems by requiring the output subgraphs to have low social tension. From the technical point of view, we study the complexity of these problems and propose algorithms for solving them effectively. Our experimental evaluation in a number of social networks reveals the efficacy and efficiency of our methods. △ Less

Submitted 19 January, 2017; originally announced January 2017.

Comments: A short version of this paper appeared in the 2017 SIAM International Conference on Data Mining, SDM'17. In this extended version, we discuss the team-formation problem variant, beside the original community-search problem, and include additional experimental results

Showing 1–9 of 9 results for author: Golshan, B