Skip to main content

Showing 1–48 of 48 results for author: Caragea, C

.
  1. arXiv:2407.05183  [pdf, other

    cs.CV cs.AI

    FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

    Authors: Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki

    Abstract: Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: ECAI 2024

  2. arXiv:2406.14756  [pdf, other

    cs.AI

    SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

    Authors: Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki

    Abstract: We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated me… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: LREC/COLING 2024

    MSC Class: I.2.7

    Journal ref: LREC-COLING. (2024) 14407-14417

  3. arXiv:2406.14666  [pdf, other

    cs.CL

    Co-training for Low Resource Scientific Natural Language Inference

    Authors: Mobashir Sadat, Cornelia Caragea

    Abstract: Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. The automatic annotation method based on distant supervision for the training set of SciNLI (Sadat and Caragea, 2022b), the first and most popular dataset for this task, results in label noise which inevitably degenerates the performance of class… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted in ACL 2024 (main conference)

  4. arXiv:2405.11877  [pdf, other

    cs.CL cs.AI cs.LG

    A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus

    Authors: Eduard Poesina, Cornelia Caragea, Radu Tudor Ionescu

    Abstract: Natural language inference (NLI), the task of recognizing the entailment relationship in sentence pairs, is an actively studied topic serving as a proxy for natural language understanding. Despite the relevance of the task in building conversational agents and improving text classification, machine translation and other NLP tasks, to the best of our knowledge, there is no publicly available NLI co… ▽ More

    Submitted 22 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

    Comments: Accepted at ACL 2024 (Main)

  5. arXiv:2404.15592  [pdf, other

    cs.CV cs.AI cs.CL cs.IR cs.LG

    ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction

    Authors: Henry Peng Zou, Vinay Samuel, Yue Zhou, Weizhi Zhang, Liancheng Fang, Zihe Song, Philip S. Yu, Cornelia Caragea

    Abstract: Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction.… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  6. arXiv:2404.08886  [pdf, other

    cs.CV cs.AI cs.CL cs.IR cs.LG

    EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM

    Authors: Henry Peng Zou, Gavin Heqing Yu, Ziwei Fan, Dan Bu, Han Liu, Peng Dai, Dongmei Jia, Cornelia Caragea

    Abstract: In e-commerce, accurately extracting product attribute values from multimodal data is crucial for improving user experience and operational efficiency of retailers. However, previous approaches to multimodal attribute value extraction often struggle with implicit attribute values embedded in images or text, rely heavily on extensive labeled data, and can easily confuse similar attribute values. To… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted by NAACL 2024 Industry Track

  7. arXiv:2404.08066  [pdf, other

    cs.CL

    MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

    Authors: Mobashir Sadat, Cornelia Caragea

    Abstract: The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset c… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Accepted to the NAACL 2024 Main Conference

  8. arXiv:2402.00976  [pdf, ps, other

    cs.LG cs.AI cs.NE

    Investigating Recurrent Transformers with Dynamic Halt

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: In this paper, we study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism - (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the a… ▽ More

    Submitted 31 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  9. arXiv:2311.09602  [pdf, other

    cs.CL

    Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion

    Authors: Smriti Singh, Cornelia Caragea, Junyi Jessy Li

    Abstract: Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; t… ▽ More

    Submitted 25 March, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera Ready

  10. arXiv:2311.04449  [pdf, other

    cs.LG cs.CL

    Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-R… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS 2023

  11. arXiv:2310.14627  [pdf, other

    cs.CL cs.LG

    CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster Tweet Classification

    Authors: Henry Peng Zou, Yue Zhou, Cornelia Caragea, Doina Caragea

    Abstract: The shared real-time information about natural disasters on social media platforms like Twitter and Facebook plays a critical role in informing volunteers, emergency managers, and response organizations. However, supervised learning models for monitoring disaster events require large amounts of annotated data, making them unrealistic for real-time use in disaster events. To address this challenge,… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by ISCRAM 2023

  12. arXiv:2310.14583  [pdf, other

    cs.CL cs.LG

    JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Classification

    Authors: Henry Peng Zou, Cornelia Caragea

    Abstract: Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data. However, existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation. In this paper, we propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning an… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 (Main)

  13. arXiv:2310.14577  [pdf, other

    cs.CL cs.LG

    DeCrisisMB: Debiased Semi-Supervised Learning for Crisis Tweet Classification via Memory Bank

    Authors: Henry Peng Zou, Yue Zhou, Weizhi Zhang, Cornelia Caragea

    Abstract: During crisis events, people often use social media platforms such as Twitter to disseminate information about the situation, warnings, advice, and support. Emergency relief organizations leverage such information to acquire timely crisis circumstances and expedite rescue operations. While existing works utilize such information to build models for crisis event analysis, fully-supervised approache… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 (Findings)

  14. arXiv:2308.09037  [pdf, other

    cs.CV

    MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

    Authors: Tiberiu Sosea, Cornelia Caragea

    Abstract: We introduce MarginMatch, a new SSL approach combining consistency regularization and pseudo-labeling, with its main novelty arising from the use of unlabeled data training dynamics to measure pseudo-label quality. Instead of using only the model's confidence on an unlabeled example at an arbitrary iteration to decide if the example should be masked or not, MarginMatch also analyzes the behavior o… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

  15. arXiv:2308.08156  [pdf, other

    cs.CL cs.LG

    Sarcasm Detection in a Disaster Context

    Authors: Tiberiu Sosea, Junyi Jessy Li, Cornelia Caragea

    Abstract: During natural disasters, people often use social media platforms such as Twitter to ask for help, to provide information about the disaster situation, or to express contempt about the unfolding event or public policies and guidelines. This contempt is in some cases expressed as sarcasm or irony. Understanding this form of speech in a disaster-centric context is essential to improving natural lang… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

  16. arXiv:2307.10779  [pdf, other

    cs.LG

    Efficient Beam Tree Recursion

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as a simple extension of Gumbel Tree RvNN and it was shown to achieve state-of-the-art length generalization performance in ListOps while maintaining comparable performance on other tasks. However, although not the worst in its kind, BT-RvNN can be still exorbitantly expensive in memory usage. In this paper, we identify the main bo… ▽ More

    Submitted 7 November, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted in NeurIPS 2023

  17. arXiv:2306.01444  [pdf, other

    cs.CL

    Unsupervised Extractive Summarization of Emotion Triggers

    Authors: Tiberiu Sosea, Hongli Zhan, Junyi Jessy Li, Cornelia Caragea

    Abstract: Understanding what leads to emotions during large-scale crises is important as it can provide groundings for expressed emotions and subsequently improve the understanding of ongoing disasters. Recent approaches trained supervised models to both detect emotions and explain emotion triggers (events and appraisals) via abstractive summarization. However, obtaining timely and qualitative abstractive s… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: ACL 2023 Camera-Ready

  18. arXiv:2305.20019  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Monotonic Location Attention for Length Generalization

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: We explore different ways to utilize position-based cross-attention in seq2seq networks to enable length generalization in algorithmic tasks. We show that a simple approach of interpolating the original and reversed encoded representations combined with relative attention allows near-perfect length generalization for both forward and reverse lookup tasks or copy tasks that had been generally hard… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted in ICML 2023

  19. arXiv:2305.19999  [pdf, other

    cs.LG cs.AI cs.CL

    Beam Tree Recursive Cells

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: We propose Beam Tree Recursive Cell (BT-Cell) - a backpropagation-friendly framework to extend Recursive Neural Networks (RvNNs) with beam search for latent structure induction. We further extend this framework by proposing a relaxation of the hard top-k operators in beam search for better propagation of gradient signals. We evaluate our proposed models in different out-of-distribution splits in b… ▽ More

    Submitted 20 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted in ICML 2023

  20. arXiv:2305.17968  [pdf, other

    cs.CL

    Data Augmentation for Low-Resource Keyphrase Generation

    Authors: Krishna Garg, Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Keyphrase generation is the task of summarizing the contents of any given article into a few salient phrases (or keyphrases). Existing works for the task mostly rely on large-scale annotated datasets, which are not easy to acquire. Very few works address the problem of keyphrase generation in low-resource settings, but they still rely on a lot of additional unlabeled data for pretraining and on au… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 9 pages, 8 tables, To appear at the Findings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada

  21. DMDD: A Large-Scale Dataset for Dataset Mentions Detection

    Authors: Huitong Pan, Qi Zhang, Eduard Dragut, Cornelia Caragea, Longin Jan Latecki

    Abstract: The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus fo… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: Pre-MIT Press publication version. Submitted to TACL

    ACM Class: I.2.7

    Journal ref: Transactions of the Association for Computational Linguistics. 11 (2023) 1132-1146

  22. arXiv:2304.13883  [pdf, other

    cs.CL cs.IR

    Neural Keyphrase Generation: Analysis and Evaluation

    Authors: Tuhin Kundu, Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Keyphrase generation aims at generating topical phrases from a given text either by copying from the original text (present keyphrases) or by producing new keyphrases (absent keyphrases) that capture the semantic meaning of the text. Encoder-decoder models are most widely used for this task because of their capabilities for absent keyphrase generation. However, there has been little to no analysis… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

  23. arXiv:2304.12404  [pdf, other

    cs.CL

    Semantic Tokenizer for Enhanced Natural Language Processing

    Authors: Sandeep Mehta, Darpan Shah, Ravindra Kulkarni, Cornelia Caragea

    Abstract: Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword regularization. We present a novel tokenizer that uses semantics to drive vocabulary construction. The tokenizer includes a trainer that uses stemming to enhance… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

  24. arXiv:2211.02971  [pdf, other

    cs.CL

    Learning to Infer from Unlabeled Data: A Semi-supervised Learning Approach for Robust Natural Language Inference

    Authors: Mobashir Sadat, Cornelia Caragea

    Abstract: Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) aims at predicting the relation between a pair of sentences (premise and hypothesis) as entailment, contradiction or semantic independence. Although deep learning models have shown promising performance for NLI in recent years, they rely on large scale expensive human-annotated datasets. Semi-supervised learning (SSL) is a po… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: Accepted in EMNLP 2022 (Findings)

  25. arXiv:2211.02810  [pdf, other

    cs.CL

    Hierarchical Multi-Label Classification of Scientific Documents

    Authors: Mobashir Sadat, Cornelia Caragea

    Abstract: Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often as… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: Accepted in EMNLP 2022 main conference

  26. arXiv:2210.12531  [pdf, other

    cs.CL cs.SI

    Why Do You Feel This Way? Summarizing Triggers of Emotions in Social Media Posts

    Authors: Hongli Zhan, Tiberiu Sosea, Cornelia Caragea, Junyi Jessy Li

    Abstract: Crises such as the COVID-19 pandemic continuously threaten our world and emotionally affect billions of people worldwide in distinct ways. Understanding the triggers leading to people's emotions is of crucial importance. Social media posts can be a good source of such analysis, yet these texts tend to be charged with multiple emotions, with triggers scattering across multiple sentences. This paper… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 Camera Ready Version

    Journal ref: https://aclanthology.org/2022.emnlp-main.642/

  27. arXiv:2205.03403  [pdf, other

    cs.CL

    A Data Cartography based MixUp for Pre-trained Language Models

    Authors: Seo Yeon Park, Cornelia Caragea

    Abstract: MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. However, selecting random pairs is not potentially an optimal choice. In this work, we propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at NAACL 2022 main conference. arXiv admin note: text overlap with arXiv:2203.07559

  28. arXiv:2203.07559  [pdf, other

    cs.CL cs.LG

    On the Calibration of Pre-trained Language Models using Mixup Guided by Area Under the Margin and Saliency

    Authors: Seo Yeon Park, Cornelia Caragea

    Abstract: A well-calibrated neural model produces confidence (probability outputs) closely approximated by the expected accuracy. While prior studies have shown that mixup training as a data augmentation technique can improve model calibration on image classification tasks, little is known about using mixup for model calibration on natural language understanding (NLU) tasks. In this paper, we explore mixup… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022 main conference

  29. arXiv:2203.06728  [pdf, other

    cs.CL

    SciNLI: A Corpus for Natural Language Inference on Scientific Text

    Authors: Mobashir Sadat, Cornelia Caragea

    Abstract: Existing Natural Language Inference (NLI) datasets, while being instrumental in the advancement of Natural Language Understanding (NLU) research, are not related to scientific text. In this paper, we introduce SciNLI, a large dataset for NLI that captures the formality in scientific text and contains 107,412 sentence pairs extracted from scholarly papers on NLP and computational linguistics. Given… ▽ More

    Submitted 14 March, 2022; v1 submitted 13 March, 2022; originally announced March 2022.

  30. arXiv:2203.04464  [pdf, other

    cs.CL

    On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

    Authors: Jishnu Ray Chowdhury, Debanjan Mahata, Cornelia Caragea

    Abstract: We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer. We make two main contributions. First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs. We show that our prop… ▽ More

    Submitted 11 March, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

  31. arXiv:2112.06776  [pdf, other

    cs.CL

    Keyphrase Generation Beyond the Boundaries of Title and Abstract

    Authors: Krishna Garg, Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Keyphrase generation aims at generating important phrases (keyphrases) that best describe a given document. In scholarly domains, current approaches have largely used only the title and abstract of the articles to generate keyphrases. In this paper, we comprehensively explore whether the integration of additional information from the full text of a given article or from semantically similar articl… ▽ More

    Submitted 20 October, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

    Comments: 9 pages, 1 figure, 7 tables

  32. arXiv:2112.01476  [pdf, other

    cs.CL

    KPDrop: Improving Absent Keyphrase Generation

    Authors: Jishnu Ray Chowdhury, Seoyeon Park, Tuhin Kundu, Cornelia Caragea

    Abstract: Keyphrase generation is the task of generating phrases (keyphrases) that summarize the main topics of a given document. Keyphrases can be either present or absent from the given document. While the extraction of present keyphrases has received much attention in the past, only recently a stronger focus has been placed on the generation of absent keyphrases. However, generating absent keyphrases is… ▽ More

    Submitted 24 October, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Accepted in EMNLP Findings 2022

  33. arXiv:2109.14059  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Generating Summaries for Scientific Paper Review

    Authors: Ana Sabina Uban, Cornelia Caragea

    Abstract: The review process is essential to ensure the quality of publications. Recently, the increase of submissions for top venues in machine learning and NLP has caused a problem of excessive burden on reviewers and has often caused concerns regarding how this may not only overload reviewers, but also may affect the quality of the reviews. An automatic system for assisting with the reviewing process cou… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

  34. arXiv:2109.03383  [pdf, ps, other

    cs.CL cs.AI

    DeepZensols: Deep Natural Language Processing Framework

    Authors: Paul Landes, Barbara Di Eugenio, Cornelia Caragea

    Abstract: Reproducing results in publications by distributing publicly available source code is becoming ever more popular. Given the difficulty of reproducing machine learning (ML) experiments, there have been significant efforts in reducing the variance of these results. As in any science, the ability to consistently reproduce results effectively strengthens the underlying hypothesis of the work, and thus… ▽ More

    Submitted 7 September, 2021; originally announced September 2021.

  35. arXiv:2107.11020  [pdf, other

    cs.CL cs.CY

    Emotion analysis and detection during COVID-19

    Authors: Tiberiu Sosea, Chau Pham, Alexander Tekle, Cornelia Caragea, Junyi Jessy Li

    Abstract: Crises such as natural disasters, global pandemics, and social unrest continuously threaten our world and emotionally affect millions of people worldwide in distinct ways. Understanding emotions that people express during large-scale crises helps inform policy makers and first responders about the emotional states of the population as well as provide emotional support to those who need such suppor… ▽ More

    Submitted 20 July, 2022; v1 submitted 23 July, 2021; originally announced July 2021.

    Comments: LREC 2022

  36. arXiv:2106.06038  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Modeling Hierarchical Structures with Continuous Recursive Neural Networks

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea

    Abstract: Recursive Neural Networks (RvNNs), which compose sequences according to their underlying hierarchical syntactic structure, have performed well in several natural language processing tasks compared to similar models without structural biases. However, traditional RvNNs are incapable of inducing the latent structure in a plain text sequence on their own. Several extensions have been proposed to over… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Accepted in ICML 2021 (long talk)

  37. arXiv:2009.00611  [pdf, other

    cs.IR cs.CL cs.DL cs.LG

    Identifying Documents In-Scope of a Collection from Web Archives

    Authors: Krutarth Patel, Cornelia Caragea, Mark Phillips, Nathaniel Fox

    Abstract: Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving inst… ▽ More

    Submitted 2 September, 2020; originally announced September 2020.

    Comments: 10 pages

    Journal ref: In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL 2020)

  38. arXiv:2008.02434  [pdf, other

    cs.AI cs.IR

    Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering

    Authors: Ye Liu, Shaika Chowdhury, Chenwei Zhang, Cornelia Caragea, Philip S. Yu

    Abstract: Healthcare question answering assistance aims to provide customer healthcare information, which widely appears in both Web and mobile Internet. The questions usually require the assistance to have proficient healthcare background knowledge as well as the reasoning ability on the knowledge. Recently a challenge involving complex healthcare reasoning, HeadQA dataset, has been proposed, which contain… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: 10 pages, 6 figures

  39. arXiv:2004.14299  [pdf, other

    cs.CL cs.CY

    Detecting Perceived Emotions in Hurricane Disasters

    Authors: Shrey Desai, Cornelia Caragea, Junyi Jessy Li

    Abstract: Natural disasters (e.g., hurricanes) affect millions of people each year, causing widespread destruction in their wake. People have recently taken to social media websites (e.g., Twitter) to share their sentiments and feelings with the larger community. Consequently, these platforms have become instrumental in understanding and perceiving emotions at scale. In this paper, we introduce HurricaneEmo… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted to ACL 2020; code available at https://github.com/shreydesai/hurricane

  40. arXiv:2001.01323  [pdf, other

    cs.IR cs.CL cs.LG

    On Identifying Hashtags in Disaster Twitter Data

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea

    Abstract: Tweet hashtags have the potential to improve the search for information during disaster events. However, there is a large number of disaster-related tweets that do not have any user-provided hashtags. Moreover, only a small number of tweets that contain actionable hashtags are useful for disaster response. To facilitate progress on automatic identification (or extraction) of disaster hashtags for… ▽ More

    Submitted 5 January, 2020; originally announced January 2020.

  41. arXiv:1910.07897  [pdf, other

    cs.IR cs.CL cs.LG

    Keyphrase Extraction from Disaster-related Tweets

    Authors: Jishnu Ray Chowdhury, Cornelia Caragea, Doina Caragea

    Abstract: While keyphrase extraction has received considerable attention in recent years, relatively few studies exist on extracting keyphrases from social media platforms such as Twitter, and even fewer for extracting disaster-related keyphrases from such sources. During a disaster, keyphrases can be extremely useful for filtering relevant tweets that can enhance situational awareness. Previously, joint tr… ▽ More

    Submitted 17 October, 2019; originally announced October 2019.

    Comments: 12 pages, 7 figures

    Journal ref: In The World Wide Web Conference (WWW '19), Ling Liu and Ryen White (Eds.). ACM, New York, NY, USA, 1555-1566 (2019)

  42. arXiv:1906.08470  [pdf, other

    cs.DL cs.IR

    Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets

    Authors: Athar Sefid, Jian Wu, Allen C. Ge, **g Zhao, Lu Liu, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles

    Abstract: Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarit… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

  43. arXiv:1903.03695  [pdf, other

    cs.CV cs.CY

    Image Privacy Prediction Using Deep Neural Networks

    Authors: Ashwini Tonge, Cornelia Caragea

    Abstract: Images today are increasingly shared online on social networking sites such as Facebook, Flickr, Foursquare, and Instagram. Despite that current social networking sites allow users to change their privacy preferences, this is often a cumbersome task for the vast majority of users on the Web, who face difficulties in assigning and managing privacy settings. Thus, automatically predicting images' pr… ▽ More

    Submitted 8 March, 2019; originally announced March 2019.

  44. Dynamic Deep Multi-modal Fusion for Image Privacy Prediction

    Authors: Ashwini Tonge, Cornelia Caragea

    Abstract: With millions of images that are shared online on social networking sites, effective methods for image privacy prediction are highly needed. In this paper, we propose an approach for fusing object, scene context, and image tags modalities derived from convolutional neural networks for accurately predicting the privacy of images shared online. Specifically, our approach identifies the set of most c… ▽ More

    Submitted 6 March, 2019; v1 submitted 27 February, 2019; originally announced February 2019.

    Comments: Accepted by The Web Conference (WWW) 2019

  45. arXiv:1604.05005  [pdf, ps, other

    cs.IR cs.DL

    A Search/Crawl Framework for Automatically Acquiring Scientific Documents

    Authors: Sujatha Das Gollapalli, Krutarth Patel, Cornelia Caragea

    Abstract: Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for scientific portals. Within our framework, publicly-avail… ▽ More

    Submitted 18 April, 2016; originally announced April 2016.

    Comments: 8 pages with references, 2 figures

    ACM Class: H.3.7

  46. arXiv:1510.08583  [pdf, other

    cs.CV cs.CY

    Privacy Prediction of Images Shared on Social Media Sites Using Deep Features

    Authors: Ashwini Tonge, Cornelia Caragea

    Abstract: Online image sharing in social media sites such as Facebook, Flickr, and Instagram can lead to unwanted disclosure and privacy violations, when privacy settings are used inappropriately. With the exponential increase in the number of images that are shared online every day, the development of effective and efficient prediction methods for image privacy settings are highly needed. The performance o… ▽ More

    Submitted 5 November, 2015; v1 submitted 29 October, 2015; originally announced October 2015.

  47. arXiv:1506.03775  [pdf, other

    cs.CL cs.IR cs.SI

    Entity-Specific Sentiment Classification of Yahoo News Comments

    Authors: Prakhar Biyani, Cornelia Caragea, Narayan Bhamidipati

    Abstract: Sentiment classification is widely used for product reviews and in online social media such as forums, Twitter, and blogs. However, the problem of classifying the sentiment of user comments on news sites has not been addressed yet. News sites cover a wide range of domains including politics, sports, technology, and entertainment, in contrast to other online social sites such as forums and review s… ▽ More

    Submitted 11 June, 2015; originally announced June 2015.

  48. arXiv:1401.6571  [pdf, other

    cs.CL cs.IR

    Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks

    Authors: Shibamouli Lahiri, Sagnik Ray Choudhury, Cornelia Caragea

    Abstract: Keyword and keyphrase extraction is an important problem in natural language processing, with applications ranging from summarization to semantic search to document clustering. Graph-based approaches to keyword and keyphrase extraction avoid the problem of acquiring a large in-domain training corpus by applying variants of PageRank algorithm on a network of words. Although graph-based approaches a… ▽ More

    Submitted 25 January, 2014; originally announced January 2014.

    Comments: 11 pages