Skip to main content

Showing 1–9 of 9 results for author: Šulc, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2302.05658  [pdf, other

    cs.CL cs.AI cs.LG

    DocILE Benchmark for Document Information Localization and Extraction

    Authors: Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas

    Abstract: This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific… ▽ More

    Submitted 3 May, 2023; v1 submitted 11 February, 2023; originally announced February 2023.

    Comments: Accepted to ICDAR 2023

  2. arXiv:2301.12394  [pdf, other

    cs.LG cs.AI

    DocILE 2023 Teaser: Document Information Localization and Extraction

    Authors: Štěpán Šimsa, Milan Šulc, Matyáš Skalický, Yash Patel, Ahmed Hamdi

    Abstract: The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the repro… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

    Comments: Accepted to ECIR 2023

  3. arXiv:2211.14451  [pdf, other

    cs.CV cs.CL cs.LG

    GLAMI-1M: A Multilingual Image-Text Fashion Dataset

    Authors: Vaclav Kosar, Antonín Hoskovec, Milan Šulc, Radek Bartyzal

    Abstract: We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  4. arXiv:2211.03646  [pdf, other

    cs.CV cs.LG

    Contrastive Classification and Representation Learning with Probabilistic Interpretation

    Authors: Rahaf Aljundi, Yash Patel, Milan Sulc, Daniel Olmeda, Nikolay Chumerin

    Abstract: Cross entropy loss has served as the main objective function for classification-based tasks. Widely deployed for learning neural network classifiers, it shows both effectiveness and a probabilistic interpretation. Recently, after the success of self supervised contrastive representation learning methods, supervised contrastive methods have been proposed to learn representations and have shown supe… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

  5. arXiv:2210.07903  [pdf, other

    cs.CV

    Text Detection Forgot About Document OCR

    Authors: Krzysztof Olejniczak, Milan Šulc

    Abstract: Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not achieve 100% accuracy, requiring human corrections in applications where correct readout is essential. Advances in machine learning enabled even more challengin… ▽ More

    Submitted 23 January, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted to the 26th Computer Vision Winter Workshop (CVWW), 2023

  6. arXiv:2206.11229  [pdf, other

    cs.IR cs.AI cs.CV cs.LG

    Business Document Information Extraction: Towards Practical Benchmarks

    Authors: Matyáš Skalický, Štěpán Šimsa, Michal Uřičář, Milan Šulc

    Abstract: Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the la… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: Accepted to CLEF 2022

  7. arXiv:2106.11695  [pdf, other

    cs.CV cs.LG

    The Hitchhiker's Guide to Prior-Shift Adaptation

    Authors: Tomas Sipka, Milan Sulc, Jiri Matas

    Abstract: In many computer vision classification tasks, class priors at test time often differ from priors on the training set. In the case of such prior shift, classifiers must be adapted correspondingly to maintain close to optimal performance. This paper analyzes methods for adaptation of probabilistic classifiers to new priors and for estimating new priors on an unlabeled test set. We propose a novel me… ▽ More

    Submitted 3 December, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: WACV 2022 16 pages, 7 figures

  8. Danish Fungi 2020 -- Not Just Another Image Recognition Dataset

    Authors: Lukáš Picek, Milan Šulc, Jiří Matas, Jacob Heilmann-Clausen, Thomas S. Jeppesen, Thomas Læssøe, Tobias Frøslev

    Abstract: We introduce a novel fine-grained dataset and benchmark, the Danish Fungi 2020 (DF20). The dataset, constructed from observations submitted to the Atlas of Danish Fungi, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing… ▽ More

    Submitted 20 August, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

  9. arXiv:1805.08235  [pdf, other

    cs.CV

    Improving CNN classifiers by estimating test-time priors

    Authors: Milan Sulc, Jiri Matas

    Abstract: The problem of different training and test set class priors is addressed in the context of CNN classifiers. We compare two different approaches to estimating the new priors: an existing Maximum Likelihood Estimation approach (optimized by an EM algorithm or by projected gradient descend) and a proposed Maximum a Posteriori approach, which increases the stability of the estimate by introducing a Di… ▽ More

    Submitted 9 April, 2019; v1 submitted 21 May, 2018; originally announced May 2018.