Skip to main content

Showing 1–1 of 1 results for author: Siebenschuh, C

.
  1. arXiv:2312.10188  [pdf, other

    cs.LG

    WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data

    Authors: Maurice Weber, Carlo Siebenschuh, Rory Butler, Anton Alexandrov, Valdemar Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, Rick Stevens, Ce Zhang

    Abstract: We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However,… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks