Search | arXiv e-print repository

Semi-supervised Human Pose Estimation in Art-historical Images

Authors: Matthias Springstein, Stefanie Schneider, Christian Althaus, Ralph Ewerth

Abstract: Gesture as language of non-verbal communication has been theoretically established since the 17th century. However, its relevance for the visual arts has been expressed only sporadically. This may be primarily due to the sheer overwhelming amount of data that traditionally had to be processed by hand. With the steady progress of digitization, though, a growing number of historical artifacts have b… ▽ More Gesture as language of non-verbal communication has been theoretically established since the 17th century. However, its relevance for the visual arts has been expressed only sporadically. This may be primarily due to the sheer overwhelming amount of data that traditionally had to be processed by hand. With the steady progress of digitization, though, a growing number of historical artifacts have been indexed and made available to the public, creating a need for automatic retrieval of art-historical motifs with similar body constellations or poses. Since the domain of art differs significantly from existing real-world data sets for human pose estimation due to its style variance, this presents new challenges. In this paper, we propose a novel approach to estimate human poses in art-historical images. In contrast to previous work that attempts to bridge the domain gap with pre-trained models or through style transfer, we suggest semi-supervised learning for both object and keypoint detection. Furthermore, we introduce a novel domain-specific art data set that includes both bounding box and keypoint annotations of human figures. Our approach achieves significantly better results than methods that use pre-trained models or style transfer. △ Less

Submitted 15 August, 2022; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: Accepted at ACM MM 2022 as a conference paper

arXiv:2108.01542 [pdf, other]

doi 10.1145/3474085.3478564

iART: A Search Engine for Art-Historical Images to Support Research in the Humanities

Authors: Matthias Springstein, Stefanie Schneider, Javad Rahnama, Eyke Hüllermeier, Hubertus Kohle, Ralph Ewerth

Abstract: In this paper, we introduce iART: an open Web platform for art-historical research that facilitates the process of comparative vision. The system integrates various machine learning techniques for keyword- and content-based image retrieval as well as category formation via clustering. An intuitive GUI supports users to define queries and explore results. By using a state-of-the-art cross-modal dee… ▽ More In this paper, we introduce iART: an open Web platform for art-historical research that facilitates the process of comparative vision. The system integrates various machine learning techniques for keyword- and content-based image retrieval as well as category formation via clustering. An intuitive GUI supports users to define queries and explore results. By using a state-of-the-art cross-modal deep learning approach, it is possible to search for concepts that were not previously detected by trained classification models. Art-historical objects from large, openly licensed collections such as Amsterdam Rijksmuseum and Wikidata are made available to users. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Journal ref: ACM Multimedia Conference 2021

arXiv:2106.09432 [pdf, other]

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

Authors: Matthias Springstein, Eric Müller-Budack, Ralph Ewerth

Abstract: The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training data. However, such a large training dataset does not exist for the task of handwritten formula recognition. In this paper, we introduce a system tha… ▽ More The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training data. However, such a large training dataset does not exist for the task of handwritten formula recognition. In this paper, we introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from LaTeX documents. For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas. The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models. We evaluate our synthesized dataset and the recognition approach on the CROHME 2014 benchmark dataset. Experimental results demonstrate the feasibility of the approach. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: Accepted for publication in: ACM International Conference on Multimedia Retrieval (ICMR) Workshop 2021

arXiv:2104.13748 [pdf, other]

QuTI! Quantifying Text-Image Consistency in Multimodal Documents

Authors: Matthias Springstein, Eric Müller-Budack, Ralph Ewerth

Abstract: The World Wide Web and social media platforms have become popular sources for news and information. Typically, multimodal information, e.g., image and text is used to convey information more effectively and to attract attention. While in most cases image content is decorative or depicts additional information, it has also been leveraged to spread misinformation and rumors in recent years. In this… ▽ More The World Wide Web and social media platforms have become popular sources for news and information. Typically, multimodal information, e.g., image and text is used to convey information more effectively and to attract attention. While in most cases image content is decorative or depicts additional information, it has also been leveraged to spread misinformation and rumors in recent years. In this paper, we present a Web-based demo application that automatically quantifies the cross-modal relations of entities (persons, locations, and events) in image and text. The applications are manifold. For example, the system can help users to explore multimodal articles more efficiently, or can assist human assessors and fact-checking efforts in the verification of the credibility of news stories, tweets, or other multimodal documents. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: Accepted for publication in: International ACM SIGIR Conference on Research and Development in Information Retrieval 2021

arXiv:2011.04714 [pdf, other]

Ontology-driven Event Type Classification in Images

Authors: Eric Müller-Budack, Matthias Springstein, Sherzod Hakimov, Kevin Mrutzek, Ralph Ewerth

Abstract: Event classification can add valuable information for semantic search and the increasingly important topic of fact validation in news. So far, only few approaches address image classification for newsworthy event types such as natural disasters, sports events, or elections. Previous work distinguishes only between a limited number of event types and relies on rather small datasets for training. In… ▽ More Event classification can add valuable information for semantic search and the increasingly important topic of fact validation in news. So far, only few approaches address image classification for newsworthy event types such as natural disasters, sports events, or elections. Previous work distinguishes only between a limited number of event types and relies on rather small datasets for training. In this paper, we present a novel ontology-driven approach for the classification of event types in images. We leverage a large number of real-world news events to pursue two objectives: First, we create an ontology based on Wikidata comprising the majority of event types. Second, we introduce a novel large-scale dataset that was acquired through Web crawling. Several baselines are proposed including an ontology-driven learning approach that aims to exploit structured information of a knowledge graph to learn relevant event relations using deep neural networks. Experimental results on existing as well as novel benchmark datasets demonstrate the superiority of the proposed ontology-driven approach. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted for publication in: IEEE Winter Conference on Applications of Computer Vision (WACV) 2021

arXiv:1906.08595 [pdf, other]

doi 10.1145/3323873.3325049

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Authors: Christian Otto, Matthias Springstein, Avishek Anand, Ralph Ewerth

Abstract: Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understan… ▽ More Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of data sets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: 8 pages, 8 Figures, 5 tables

Journal ref: In Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR '19). ACM, New York, NY, USA, 168-176

arXiv:1806.06796 [pdf, other]

TIB-arXiv: An Alternative Search Portal for the arXiv Pre-print Server

Authors: Matthias Springstein, Huu Hung Nguyen, Anett Hoppe, Ralph Ewerth

Abstract: arXiv is a popular pre-print server focusing on natural science disciplines (e.g. physics, computer science, quantitative biology). As a platform with focus on easy publishing services it does not provide enhanced search functionality -- but offers programming interfaces which allow external parties to add these services. This paper presents extensions of the open source framework arXiv Sanity Pre… ▽ More arXiv is a popular pre-print server focusing on natural science disciplines (e.g. physics, computer science, quantitative biology). As a platform with focus on easy publishing services it does not provide enhanced search functionality -- but offers programming interfaces which allow external parties to add these services. This paper presents extensions of the open source framework arXiv Sanity Preserver (SP). With respect to the original framework, it derestricts the topical focus and allows for text-based search and visualisation of all papers in arXiv. To this end, all papers are stored in a unified back-end; the extension provides enhanced search and ranking facilities and allows the exploration of arXiv papers by a novel user interface. △ Less

Submitted 18 June, 2018; originally announced June 2018.

Showing 1–7 of 7 results for author: Springstein, M