Skip to main content

Showing 1–12 of 12 results for author: Shiralkar, P

.
  1. arXiv:2309.05619  [pdf, other

    cs.CL

    Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

    Authors: Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel J Perry, Prashant Shiralkar, Zhengzheng Xing, Aaron Colak

    Abstract: Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemb… ▽ More

    Submitted 19 November, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Camera ready version for 2023 EMNLP (The Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM))

  2. arXiv:2305.14549  [pdf, other

    cs.IR

    Extracting Shop** Interest-Related Product Types from the Web

    Authors: Yinghao Li, Colin Lockard, Prashant Shiralkar, Chao Zhang

    Abstract: Recommending a diversity of product types (PTs) is important for a good shop** experience when customers are looking for products around their high-level shop** interests (SIs) such as hiking. However, the SI-PT connection is typically absent in e-commerce product catalogs and expensive to construct manually due to the volume of potential SIs, which prevents us from establishing a recommender… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  3. arXiv:2208.13086  [pdf

    cs.IR cs.LG

    Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents

    Authors: Ritesh Sarkhel, Binxuan Huang, Colin Lockard, Prashant Shiralkar

    Abstract: Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few human-labeled web pages from each target website or thousands of human-labeled web pages from some seed websites to train a transferable extraction model that generalizes… ▽ More

    Submitted 27 August, 2022; originally announced August 2022.

  4. arXiv:2201.10608  [pdf, other

    cs.CL

    DOM-LM: Learning Generalizable Representations for HTML Documents

    Authors: Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun

    Abstract: HTML documents are an important medium for disseminating information on the Web for human consumption. An HTML document presents information in multiple text formats including unstructured text, structured key-value pairs, and tables. Effective representation of these documents is essential for machine understanding to enable a wide range of applications, such as Question Answering, Web Search, an… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

  5. TCN: Table Convolutional Network for Web Table Interpretation

    Authors: Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, Meng Jiang

    Abstract: Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. However, extracting knowledge from relational tables is challenging because of sparse contextual information. Existing work linearize table cells and heavi… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

  6. arXiv:2005.07105  [pdf, other

    cs.CL cs.IR

    ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages

    Authors: Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi

    Abstract: In many documents, such as semi-structured webpages, textual semantics are augmented with additional information conveyed using visual elements including layout, font size, and color. Prior work on information extraction from semi-structured websites has required learning an extraction model specific to a given template via either manually labeled or distantly supervised data from that template. I… ▽ More

    Submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted to ACL 2020

  7. arXiv:2004.07493  [pdf, other

    cs.CL cs.IR cs.LG

    TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition

    Authors: Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, Xiang Ren

    Abstract: Training neural models for named entity recognition (NER) in a new domain often requires additional human annotations (e.g., tens of thousands of labeled instances) that are usually expensive and time-consuming to collect. Thus, a crucial research question is how to obtain supervision in a cost-effective way. In this paper, we introduce "entity triggers," an effective proxy of human explanations f… ▽ More

    Submitted 6 July, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

    Comments: Accepted to the ACL 2020. Project page: https://inklab.usc.edu/TriggerNER/ (Fixed a few typos and added a new figure.)

    Journal ref: Proc. of ACL 2020, page 8503--8511

  8. arXiv:1804.04635  [pdf, other

    cs.AI cs.IR

    CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

    Authors: Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar

    Abstract: The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated lab… ▽ More

    Submitted 12 April, 2018; originally announced April 2018.

    Comments: Expanded version of paper under review for VLDB

  9. arXiv:1712.08674  [pdf

    cs.IR

    RelSifter: Scoring Triples from Type-like Relations - The Samphire Triple Scorer at WSDM Cup 2017

    Authors: Prashant Shiralkar, Mihai Avram, Giovanni Luca Ciampaglia, Filippo Menczer, Alessandro Flammini

    Abstract: We present RelSifter, a supervised learning approach to the problem of assigning relevance scores to triples expressing type-like relations such as 'profession' and 'nationality.' To provide additional contextual information about individuals and relations we supplement the data provided as part of the WSDM 2017 Triple Score contest with Wikidata and DBpedia, two large-scale knowledge graphs (KG).… ▽ More

    Submitted 22 December, 2017; originally announced December 2017.

    Comments: Triple Scorer at WSDM Cup 2017, see arXiv:1712.08081

    ACM Class: H.3

  10. arXiv:1708.07239  [pdf, other

    cs.AI cs.SI

    Finding Streams in Knowledge Graphs to Support Fact Checking

    Authors: Prashant Shiralkar, Alessandro Flammini, Filippo Menczer, Giovanni Luca Ciampaglia

    Abstract: The volume and velocity of information that gets generated online limits current journalistic practices to fact-check claims at the same rate. Computational approaches for fact checking may be the key to help mitigate the risks of massive misinformation spread. Such approaches can be designed to not only be scalable and effective at assessing veracity of dubious claims, but also to boost a human f… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

    Comments: Extended version of the paper in proceedings of ICDM 2017

  11. arXiv:1601.05140  [pdf

    cs.SI cs.AI cs.CY physics.data-an physics.soc-ph

    The DARPA Twitter Bot Challenge

    Authors: V. S. Subrahmanian, Amos Azaria, Skylar Durst, Vadim Kagan, Aram Galstyan, Kristina Lerman, Linhong Zhu, Emilio Ferrara, Alessandro Flammini, Filippo Menczer, Andrew Stevens, Alexander Dekhtyar, Shuyang Gao, Tad Hogg, Farshad Kooti, Yan Liu, Onur Varol, Prashant Shiralkar, Vinod Vydiswaran, Qiaozhu Mei, Tim Hwang

    Abstract: A number of organizations ranging from terrorist groups such as ISIS to politicians and nation states reportedly conduct explicit campaigns to influence opinion on social media, posing a risk to democratic processes. There is thus a growing need to identify and eliminate "influence bots" - realistic, automated identities that illicitly shape discussion on sites like Twitter and Facebook - before t… ▽ More

    Submitted 21 April, 2016; v1 submitted 19 January, 2016; originally announced January 2016.

    Comments: IEEE Computer Magazine, in press

    Journal ref: Computer 49 (6), 38-46. IEEE, 2016

  12. arXiv:1501.03471  [pdf, other

    cs.CY cs.SI physics.soc-ph

    Computational fact checking from knowledge networks

    Authors: Giovanni Luca Ciampaglia, Prashant Shiralkar, Luis M. Rocha, Johan Bollen, Filippo Menczer, Alessandro Flammini

    Abstract: Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Computational fact checking may significantly enhance our ability to evaluate the veracity of dubious information. Here we show that the complexities of human fact checking can be approximated quite well by finding the shortest path between concept nodes under proper… ▽ More

    Submitted 14 January, 2015; originally announced January 2015.