Skip to main content

Showing 1–5 of 5 results for author: Wendt, J B

.
  1. arXiv:2406.05079  [pdf, other

    cs.CL cs.LG

    SUMIE: A Synthetic Benchmark for Incremental Entity Summarization

    Authors: Eunjeong Hwang, Yichao Zhou, Beliz Gunel, James Bradley Wendt, Sandeep Tata

    Abstract: No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively high… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 24 figures, 4 tables

  2. arXiv:2403.19710  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    STRUM-LLM: Attributed and Structured Contrastive Summarization

    Authors: Beliz Gunel, James B. Wendt, **g Xie, Yichao Zhou, Nguyen Vo, Zachary Fisher, Sandeep Tata

    Abstract: Users often struggle with decision-making between two options (A vs B), as it usually requires time-consuming research across multiple web pages. We propose STRUM-LLM that addresses this challenge by generating attributed, structured, and helpful contrastive summaries that highlight key differences between the two options. STRUM-LLM identifies helpful contrast: the specific attributes along which… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  3. arXiv:2212.10047  [pdf, other

    cs.CL

    An Augmentation Strategy for Visually Rich Documents

    Authors: **g Xie, James B. Wendt, Yichao Zhou, Seth Ebner, Sandeep Tata

    Abstract: Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we… ▽ More

    Submitted 22 December, 2022; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: 9 pages, 6 figures, 3 tables

  4. arXiv:2210.16391  [pdf, other

    cs.CL

    Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

    Authors: Yichao Zhou, James B. Wendt, Navneet Potti, **g Xie, Sandeep Tata

    Abstract: A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled do… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: 9 pages, 8 figures, 3 tables

  5. arXiv:2201.02647  [pdf, other

    cs.LG cs.IR

    Data-Efficient Information Extraction from Form-Like Documents

    Authors: Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, **g Xie

    Abstract: Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should genera… ▽ More

    Submitted 7 January, 2022; originally announced January 2022.

    Comments: Published at the 2nd Document Intelligence Workshop @ KDD 2021 (https://document-intelligence.github.io/DI-2021/)