Skip to main content

Showing 1–19 of 19 results for author: Staar, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19102  [pdf, other

    cs.CL cs.AI cs.IR

    Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

    Authors: Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar

    Abstract: Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

  2. arXiv:2405.10725  [pdf, other

    cs.CL cs.IR

    INDUS: Effective and Efficient Language Models for Scientific Applications

    Authors: Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grezes, Megan Ansdell, Alberto Accomazzi, Yousef El-Kurdi, Davis Wertheimer, Birgit Pfitzmann, Cesar Berrospi Ramis , et al. (9 additional authors not shown)

    Abstract: Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics,… ▽ More

    Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  3. arXiv:2405.00505  [pdf, other

    cs.IR cs.LG

    KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

    Authors: Oshri Naparstek, Roi Pony, Inbar Shapira, Foad Abo Dahood, Ophir Azulai, Yevgeny Yaroker, Nadav Rubinstein, Maksym Lysak, Peter Staar, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Elad Amrani, Idan Friedman, Orit Prince, Yevgeny Burshtein, Adi Raz Goldfarb, Udi Barzelay

    Abstract: In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: accepted ICDAR2024

  4. ESG Accountability Made Easy: DocQA at Your Service

    Authors: Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar

    Abstract: We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Accepted at the Demonstration Track of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI 24)

    Journal ref: AAAI 2024, 38, 23814-23816

  5. arXiv:2308.12234  [pdf, other

    cs.CV

    MolGrapher: Graph-based Visual Recognition of Chemical Structures

    Authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu

    Abstract: The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diver… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  6. ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

    Authors: Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, Peter Staar

    Abstract: Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions t… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: ICDAR 2023, 10 pages, 4 figures

  7. arXiv:2305.03393  [pdf, other

    cs.CV

    Optimized Table Tokenization for Table Structure Recognition

    Authors: Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Peter Staar

    Abstract: Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Si… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to ICDAR 2023, 12 pages, 6 figures

  8. arXiv:2210.13118  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Term Extraction for Highly Technical Domains

    Authors: Francesco Fusco, Peter Staar, Diego Antognini

    Abstract: Term extraction is an information extraction task at the root of knowledge discovery platforms. Develo** term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowled… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted at EMNLP 2022 (industry). 8 pages, 3 figures, 3 tables

  9. arXiv:2209.03648  [pdf, other

    cs.CV

    FETA: Towards Specializing Foundation Models for Expert Task Applications

    Authors: Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

    Abstract: Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail… ▽ More

    Submitted 19 December, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

  10. arXiv:2207.01220  [pdf, other

    cs.CV cs.AI

    BusiNet -- a Light and Fast Text Detection Network for Business Documents

    Authors: Oshri Naparstek, Ophir Azulai, Daniel Rotman, Yevgeny Burshtein, Peter Staar, Udi Barzelay

    Abstract: For digitizing or indexing physical documents, Optical Character Recognition (OCR), the process of extracting textual information from scanned documents, is a vital technology. When a document is visually damaged or contains non-textual elements, existing technologies can yield poor results, as erroneous detection results can greatly affect the quality of OCR. In this paper we present a detection… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

  11. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

    Authors: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, Peter W J Staar

    Abstract: Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since t… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Comments: 9 pages, 6 figures, 5 tables. Accepted paper at SIGKDD 2022 conference

  12. Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

    Authors: Christoph Auer, Michele Dolfi, André Carvalho, Cesar Berrospi Ramis, Peter W. J. Staar

    Abstract: Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: 11 pages, 7 figures, to be published in IEEE CLOUD 2022

    ACM Class: I.7.5; I.2.1; C.1.4; C.4

  13. arXiv:2203.01017  [pdf, other

    cs.CV cs.LG

    TableFormer: Table Structure Understanding with Transformers

    Authors: Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar

    Abstract: Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separati… ▽ More

    Submitted 11 March, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

  14. arXiv:2202.04350  [pdf, other

    cs.CL cs.AI cs.LG

    pNLP-Mixer: an Efficient all-MLP Architecture for Language

    Authors: Francesco Fusco, Damian Pascual, Peter Staar, Diego Antognini

    Abstract: Large pre-trained language models based on transformer architecture have drastically changed the natural language processing (NLP) landscape. However, deploying those models for on-device applications in constrained devices such as smart watches is completely impractical due to their size and inference cost. As an alternative to transformer-based architectures, recent work on efficient NLP has sho… ▽ More

    Submitted 25 May, 2023; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: Accepted at ACL 2023 (industry). 8 pages, 2 figures, 4 tables

  15. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  16. arXiv:2102.09395  [pdf, other

    cs.LG cs.CV cs.IR

    Robust PDF Document Conversion Using Recurrent Neural Networks

    Authors: Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

    Abstract: The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretatio… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: 9 pages, 2 tables, 4 figures, uses aaai21.sty. Accepted at the "Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21)". Received the "IAAI-21 Innovative Application Award"

    ACM Class: I.7.5; I.5.1; I.5.2; I.5.4; I.5.5; I.2.1

  17. arXiv:1907.08400  [pdf, other

    cs.IR cs.LG

    An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

    Authors: Matteo Manica, Christoph Auer, Valery Weber, Federico Zipoli, Michele Dolfi, Peter Staar, Teodoro Laino, Costas Bekas, Akihiro Fujita, Hiroki Toda, Shuichi Hirose, Yasumitsu Orii

    Abstract: Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document… ▽ More

    Submitted 19 July, 2019; originally announced July 2019.

    Comments: 4 pages, 1 figure, Workshop on Applied Data Science for Healthcare at KDD, Anchorage, AK, 2019

  18. arXiv:1806.02284  [pdf, other

    cs.DL cs.CV cs.DC

    Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

    Authors: Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas

    Abstract: Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. comple… ▽ More

    Submitted 24 May, 2018; originally announced June 2018.

    Comments: Accepted paper at KDD 2018 conference

  19. arXiv:1805.09687  [pdf, other

    cs.DL cs.CL cs.CV cs.DC cs.IR

    Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

    Authors: Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas

    Abstract: Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables)… ▽ More

    Submitted 15 May, 2018; originally announced May 2018.

    Comments: Accepted in SysML 2018 (www.sysml.cc)