Search | arXiv e-print repository

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

Authors: Oshri Naparstek, Roi Pony, Inbar Shapira, Foad Abo Dahood, Ophir Azulai, Yevgeny Yaroker, Nadav Rubinstein, Maksym Lysak, Peter Staar, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Elad Amrani, Idan Friedman, Orit Prince, Yevgeny Burshtein, Adi Raz Goldfarb, Udi Barzelay

Abstract: In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where… ▽ More In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: accepted ICDAR2024

arXiv:2311.18481 [pdf, other]

doi 10.1609/aaai.v38i21.30574

ESG Accountability Made Easy: DocQA at Your Service

Authors: Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar

Abstract: We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via… ▽ More We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted at the Demonstration Track of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI 24)

Journal ref: AAAI 2024, 38, 23814-23816

arXiv:2308.12234 [pdf, other]

MolGrapher: Graph-based Visual Recognition of Chemical Structures

Authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu

Abstract: The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diver… ▽ More The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2305.14962 [pdf, other]

doi 10.1007/978-3-031-41679-8_27

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Authors: Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, Peter Staar

Abstract: Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions t… ▽ More Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textit{ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents}, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: ICDAR 2023, 10 pages, 4 figures

arXiv:2305.04927 [pdf, other]

Detecting and Reasoning of Deleted Tweets before they are Posted

Authors: Hamdy Mubarak, Samir Abdaljalil, Azza Nassar, Firoj Alam

Abstract: Social media platforms empower us in several ways, from information dissemination to consumption. While these platforms are useful in promoting citizen journalism, public awareness etc., they have misuse potentials. Malicious users use them to disseminate hate-speech, offensive content, rumor etc. to gain social and political agendas or to harm individuals, entities and organizations. Often times,… ▽ More Social media platforms empower us in several ways, from information dissemination to consumption. While these platforms are useful in promoting citizen journalism, public awareness etc., they have misuse potentials. Malicious users use them to disseminate hate-speech, offensive content, rumor etc. to gain social and political agendas or to harm individuals, entities and organizations. Often times, general users unconsciously share information without verifying it, or unintentionally post harmful messages. Some of such content often get deleted either by the platform due to the violation of terms and policies, or users themselves for different reasons, e.g., regrets. There is a wide range of studies in characterizing, understanding and predicting deleted content. However, studies which aims to identify the fine-grained reasons (e.g., posts are offensive, hate speech or no identifiable reason) behind deleted content, are limited. In this study we address this gap, by identifying deleted tweets, particularly within the Arabic context, and labeling them with a corresponding fine-grained disinformation category. We then develop models that can predict the potentiality of tweets getting deleted, as well as the potential reasons behind deletion. Such models can help in moderating social media posts before even posting. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: disinformation, misinformation, fake news

MSC Class: 68T50 ACM Class: F.2.2; I.2.7

arXiv:2305.03393 [pdf, other]

Optimized Table Tokenization for Table Structure Recognition

Authors: Maksym Lysak, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Peter Staar

Abstract: Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Si… ▽ More Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted to ICDAR 2023, 12 pages, 6 figures

arXiv:2206.01062 [pdf, other]

doi 10.1145/3534678.3539043

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Authors: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, Peter W J Staar

Abstract: Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since t… ▽ More Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis. △ Less

Submitted 2 June, 2022; originally announced June 2022.

Comments: 9 pages, 6 figures, 5 tables. Accepted paper at SIGKDD 2022 conference

arXiv:2205.12328 [pdf]

Multilevel sentiment analysis in arabic

Authors: Ahmed Nassar, Ebru Sezer

Abstract: In this study, we aimed to improve the performance results of Arabic sentiment analysis. This can be achieved by investigating the most successful machine learning method and the most useful feature vector to classify sentiments in both term and document levels into two (positive or negative) categories. Moreover, specification of one polarity degree for the term that has more than one is investig… ▽ More In this study, we aimed to improve the performance results of Arabic sentiment analysis. This can be achieved by investigating the most successful machine learning method and the most useful feature vector to classify sentiments in both term and document levels into two (positive or negative) categories. Moreover, specification of one polarity degree for the term that has more than one is investigated. Also to handle the negations and intensifications, some rules are developed. According to the obtained results, Artificial Neural Network classifier is nominated as the best classifier in both term and document level sentiment analysis (SA) for Arabic Language. Furthermore, the average F-score achieved in the term level SA for both positive and negative testing classes is 0.92. In the document level SA, the average F-score for positive testing classes is 0.94, while for negative classes is 0.93. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: 10 pages, 3 figures, Published in: 2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE), Date of Conference: 26-27 March 2019

Report number: INSPEC Accession Number: 18793641

arXiv:2203.01017 [pdf, other]

TableFormer: Table Structure Understanding with Transformers

Authors: Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar

Abstract: Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separati… ▽ More Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a non-trivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-to-end deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables. △ Less

Submitted 11 March, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2107.12940 [pdf, other]

Finding Failures in High-Fidelity Simulation using Adaptive Stress Testing and the Backward Algorithm

Authors: Mark Koren, Ahmed Nassar, Mykel J. Kochenderfer

Abstract: Validating the safety of autonomous systems generally requires the use of high-fidelity simulators that adequately capture the variability of real-world scenarios. However, it is generally not feasible to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST) is a method that uses reinforcement learning to find the most likely failure of a system. AST wit… ▽ More Validating the safety of autonomous systems generally requires the use of high-fidelity simulators that adequately capture the variability of real-world scenarios. However, it is generally not feasible to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST) is a method that uses reinforcement learning to find the most likely failure of a system. AST with a deep reinforcement learning solver has been shown to be effective in finding failures across a range of different systems. This approach generally involves running many simulations, which can be very expensive when using a high-fidelity simulator. To improve efficiency, we present a method that first finds failures in a low-fidelity simulator. It then uses the backward algorithm, which trains a deep neural network policy using a single expert demonstration, to adapt the low-fidelity failures to high-fidelity. We have created a series of autonomous vehicle validation case studies that represent some of the ways low-fidelity and high-fidelity simulators can differ, such as time discretization. We demonstrate in a variety of case studies that this new AST approach is able to find failures with significantly fewer high-fidelity simulation steps than are needed when just running AST directly in high-fidelity. As a proof of concept, we also demonstrate AST on NVIDIA's DriveSim simulator, an industry state-of-the-art high-fidelity simulator for finding failures in autonomous vehicles. △ Less

Submitted 27 July, 2021; originally announced July 2021.

Comments: Accepted to IROS 2021

arXiv:2105.01512 [pdf, other]

doi 10.46298/lmcs-19(4:19)2023

Simulation by Rounds of Letter-to-Letter Transducers

Authors: Antonio Abu Nassar, Shaull Almagor

Abstract: Letter-to-letter transducers are a standard formalism for modeling reactive systems. Often, two transducers that model similar systems differ locally from one another, by behaving similarly, up to permutations of the input and output letters within "rounds". In this work, we introduce and study notions of simulation by rounds and equivalence by rounds of transducers. In our setting, words are part… ▽ More Letter-to-letter transducers are a standard formalism for modeling reactive systems. Often, two transducers that model similar systems differ locally from one another, by behaving similarly, up to permutations of the input and output letters within "rounds". In this work, we introduce and study notions of simulation by rounds and equivalence by rounds of transducers. In our setting, words are partitioned to consecutive subwords of a fixed length $k$, called rounds. Then, a transducer $\mathcal{T}_1$ is $k$-round simulated by transducer $\mathcal{T}_2$ if, intuitively, for every input word $x$, we can permute the letters within each round in $x$, such that the output of $\mathcal{T}_2$ on the permuted word is itself a permutation of the output of $\mathcal{T}_1$ on $x$. Finally, two transducers are $k$-round equivalent if they simulate each other. We solve two main decision problems, namely whether $\mathcal{T}_2$ $k$-round simulates $\mathcal{T}_1$ (1) when $k$ is given as input, and (2) for an existentially quantified $k$. We demonstrate the usefulness of the definitions by applying them to process symmetry: a setting in which a permutation in the identities of processes in a multi-process system naturally gives rise to two transducers, whose $k$-round equivalence corresponds to stability against such permutations. △ Less

Submitted 4 December, 2023; v1 submitted 4 May, 2021; originally announced May 2021.

Journal ref: Logical Methods in Computer Science, Volume 19, Issue 4 (December 5, 2023) lmcs:9920

arXiv:2102.09395 [pdf, other]

Robust PDF Document Conversion Using Recurrent Neural Networks

Authors: Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

Abstract: The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretatio… ▽ More The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19. △ Less

Submitted 18 February, 2021; originally announced February 2021.

Comments: 9 pages, 2 tables, 4 figures, uses aaai21.sty. Accepted at the "Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21)". Received the "IAAI-21 Innovative Application Award"

ACM Class: I.7.5; I.5.1; I.5.2; I.5.4; I.5.5; I.2.1

arXiv:2010.09916 [pdf, other]

Deep Reinforcement Learning for Adaptive Network Slicing in 5G for Intelligent Vehicular Systems and Smart Cities

Authors: Almuthanna Nassar, Yasin Yilmaz

Abstract: Intelligent vehicular systems and smart city applications are the fastest growing Internet of things (IoT) implementations at a compound annual growth rate of 30%. In view of the recent advances in IoT devices and the emerging new breed of IoT applications driven by artificial intelligence (AI), fog radio access network (F-RAN) has been recently introduced for the fifth generation (5G) wireless co… ▽ More Intelligent vehicular systems and smart city applications are the fastest growing Internet of things (IoT) implementations at a compound annual growth rate of 30%. In view of the recent advances in IoT devices and the emerging new breed of IoT applications driven by artificial intelligence (AI), fog radio access network (F-RAN) has been recently introduced for the fifth generation (5G) wireless communications to overcome the latency limitations of cloud-RAN (C-RAN). We consider the network slicing problem of allocating the limited resources at the network edge (fog nodes) to vehicular and smart city users with heterogeneous latency and computing demands in dynamic environments. We develop a network slicing model based on a cluster of fog nodes (FNs) coordinated with an edge controller (EC) to efficiently utilize the limited resources at the network edge. For each service request in a cluster, the EC decides which FN to execute the task, i.e., locally serve the request at the edge, or to reject the task and refer it to the cloud. We formulate the problem as infinite-horizon Markov decision process (MDP) and propose a deep reinforcement learning (DRL) solution to adaptively learn the optimal slicing policy. The performance of the proposed DRL-based slicing method is evaluated by comparing it with other slicing approaches in dynamic environments and for different scenarios of design objectives. Comprehensive simulation results corroborate that the proposed DRL-based EC quickly learns the optimal policy through interaction with the environment, which enables adaptive and automated network slicing for efficient resource allocation in dynamic vehicular and smart city environments. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2003.10151 [pdf, other]

GeoGraph: Learning graph-based multi-view object detection with geometric cues end-to-end

Authors: Ahmed Samy Nassar, Stefano D'Aronco, Sébastien Lefèvre, Jan D. Wegner

Abstract: In this paper we propose an end-to-end learnable approach that detects static urban objects from multiple views, re-identifies instances, and finally assigns a geographic position per object. Our method relies on a Graph Neural Network (GNN) to, detect all objects and output their geographic positions given images and approximate camera poses as input. Our GNN simultaneously models relative pose a… ▽ More In this paper we propose an end-to-end learnable approach that detects static urban objects from multiple views, re-identifies instances, and finally assigns a geographic position per object. Our method relies on a Graph Neural Network (GNN) to, detect all objects and output their geographic positions given images and approximate camera poses as input. Our GNN simultaneously models relative pose and image evidence, and is further able to deal with an arbitrary number of input views. Our method is robust to occlusion, with similar appearance of neighboring objects, and severe changes in viewpoints by jointly reasoning about visual image appearance and relative pose. Experimental evaluation on two challenging, large-scale datasets and comparison with state-of-the-art methods show significant and systematic improvements both in accuracy and efficiency, with 2-6% gain in detection and re-ID average precision as well as 8x reduction of training time. △ Less

Submitted 24 March, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

arXiv:1907.10892 [pdf, other]

Simultaneous multi-view instance detection with learned geometric soft-constraints

Authors: Ahmed Samy Nassar, Sebastien Lefevre, Jan D. Wegner

Abstract: We propose to jointly learn multi-view geometry and war** between views of the same object instances for robust cross-view object detection. What makes multi-view object instance detection difficult are strong changes in viewpoint, lighting conditions, high similarity of neighbouring objects, and strong variability in scale. By turning object detection and instance re-identification in different… ▽ More We propose to jointly learn multi-view geometry and war** between views of the same object instances for robust cross-view object detection. What makes multi-view object instance detection difficult are strong changes in viewpoint, lighting conditions, high similarity of neighbouring objects, and strong variability in scale. By turning object detection and instance re-identification in different views into a joint learning task, we are able to incorporate both image appearance and geometric soft constraints into a single, multi-view detection process that is learnable end-to-end. We validate our method on a new, large data set of street-level panoramas of urban objects and show superior performance compared to various baselines. Our contribution is threefold: a large-scale, publicly available data set for multi-view instance detection and re-identification; an annotation tool custom-tailored for multi-view instance detection; and a novel, holistic multi-view instance detection and re-identification method that jointly models geometry and appearance across views. △ Less

Submitted 25 July, 2019; originally announced July 2019.

Comments: Internationcal Conference on Computer Vision 2019 (ICCV 19)

arXiv:1806.04582 [pdf, other]

Reinforcement Learning-based Resource Allocation in Fog RAN for IoT with Heterogeneous Latency Requirements

Authors: Almuthanna T. Nassar, Yasin Yilmaz

Abstract: In light of the quick proliferation of Internet of things (IoT) devices and applications, fog radio access network (Fog-RAN) has been recently proposed for fifth generation (5G) wireless communications to assure the requirements of ultra-reliable low-latency communication (URLLC) for the IoT applications which cannot accommodate large delays. Hence, fog nodes (FNs) are equipped with computing, sig… ▽ More In light of the quick proliferation of Internet of things (IoT) devices and applications, fog radio access network (Fog-RAN) has been recently proposed for fifth generation (5G) wireless communications to assure the requirements of ultra-reliable low-latency communication (URLLC) for the IoT applications which cannot accommodate large delays. Hence, fog nodes (FNs) are equipped with computing, signal processing and storage capabilities to extend the inherent operations and services of the cloud to the edge. We consider the problem of sequentially allocating the FN's limited resources to the IoT applications of heterogeneous latency requirements. For each access request from an IoT user, the FN needs to decide whether to serve it locally utilizing its own resources or to refer it to the cloud to conserve its valuable resources for future users of potentially higher utility to the system (i.e., lower latency requirement). We formulate the Fog-RAN resource allocation problem in the form of a Markov decision process (MDP), and employ several reinforcement learning (RL) methods, namely Q-learning, SARSA, Expected SARSA, and Monte Carlo, for solving the MDP problem by learning the optimum decision-making policies. We verify the performance and adaptivity of the RL methods and compare it with the performance of a fixed-threshold-based algorithm. Extensive simulation results considering 19 IoT environments of heterogeneous latency requirements corroborate that RL methods always achieve the best possible performance regardless of the IoT environment. △ Less

Submitted 15 January, 2019; v1 submitted 27 May, 2018; originally announced June 2018.

arXiv:1705.08101 [pdf, other]

doi 10.1109/JPROC.2017.2684300

Towards seamless multi-view scene analysis from satellite to street-level

Authors: Sébastien Lefèvre, Devis Tuia, Jan Dirk Wegner, Timothée Produit, Ahmed Samy Nassar

Abstract: In this paper, we discuss and review how combined multi-view imagery from satellite to street-level can benefit scene analysis. Numerous works exist that merge information from remote sensing and images acquired from the ground for tasks like land cover map**, object detection, or scene understanding. What makes the combination of overhead and street-level images challenging, is the strongly var… ▽ More In this paper, we discuss and review how combined multi-view imagery from satellite to street-level can benefit scene analysis. Numerous works exist that merge information from remote sensing and images acquired from the ground for tasks like land cover map**, object detection, or scene understanding. What makes the combination of overhead and street-level images challenging, is the strongly varying viewpoint, different scale, illumination, sensor modality and time of acquisition. Direct (dense) matching of images on a per-pixel basis is thus often impossible, and one has to resort to alternative strategies that will be discussed in this paper. We review recent works that attempt to combine images taken from the ground and overhead views for purposes like scene registration, reconstruction, or classification. Three methods that represent the wide range of potential methods and applications (change detection, image orientation, and tree cataloging) are described in detail. We show that cross-fertilization between remote sensing, computer vision and machine learning is very valuable to make the best of geographic data available from Earth Observation sensors and ground imagery. Despite its challenges, we believe that integrating these complementary data sources will lead to major breakthroughs in Big GeoData. △ Less

Submitted 23 May, 2017; originally announced May 2017.

Journal ref: Proceedings of the IEEE, 105, pp. 1884-1899, 2017

Showing 1–17 of 17 results for author: Nassar, A