-
Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting
Authors:
Omar Hamed,
Souhail Bakkali,
Marie-Francine Moens,
Matthew Blaschko,
Jordy Van Landeghem
Abstract:
This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incor…
▽ More
This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
LLMChain: Blockchain-based Reputation System for Sharing and Evaluating Large Language Models
Authors:
Mouhamed Amine Bouchiha,
Quentin Telnoff,
Souhail Bakkali,
Ronan Champagnat,
Mourad Rabah,
Mickaël Coustaty,
Yacine Ghamri-Doudane
Abstract:
Large Language Models (LLMs) have witnessed rapid growth in emerging challenges and capabilities of language understanding, generation, and reasoning. Despite their remarkable performance in natural language processing-based applications, LLMs are susceptible to undesirable and erratic behaviors, including hallucinations, unreliable reasoning, and the generation of harmful content. These flawed be…
▽ More
Large Language Models (LLMs) have witnessed rapid growth in emerging challenges and capabilities of language understanding, generation, and reasoning. Despite their remarkable performance in natural language processing-based applications, LLMs are susceptible to undesirable and erratic behaviors, including hallucinations, unreliable reasoning, and the generation of harmful content. These flawed behaviors undermine trust in LLMs and pose significant hurdles to their adoption in real-world applications, such as legal assistance and medical diagnosis, where precision, reliability, and ethical considerations are paramount. These could also lead to user dissatisfaction, which is currently inadequately assessed and captured. Therefore, to effectively and transparently assess users' satisfaction and trust in their interactions with LLMs, we design and develop LLMChain, a decentralized blockchain-based reputation system that combines automatic evaluation with human feedback to assign contextual reputation scores that accurately reflect LLM's behavior. LLMChain not only helps users and entities identify the most trustworthy LLM for their specific needs, but also provides LLM developers with valuable information to refine and improve their models. To our knowledge, this is the first time that a blockchain-based distributed framework for sharing and evaluating LLMs has been introduced. Implemented using emerging tools, LLMChain is evaluated across two benchmark datasets, showcasing its effectiveness and scalability in assessing seven different LLMs.
△ Less
Submitted 3 May, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
IDTrust: Deep Identity Document Quality Detection with Bandpass Filtering
Authors:
Musab Al-Ghadi,
Joris Voerman,
Souhail Bakkali,
Mickaël Coustaty,
Nicolas Sidere,
Xavier St-Georges
Abstract:
The increasing use of digital technologies and mobile-based registration procedures highlights the vital role of personal identity documents (IDs) in verifying users and safeguarding sensitive information. However, the rise in counterfeit ID production poses a significant challenge, necessitating the development of reliable and efficient automated verification methods. This paper introduces IDTrus…
▽ More
The increasing use of digital technologies and mobile-based registration procedures highlights the vital role of personal identity documents (IDs) in verifying users and safeguarding sensitive information. However, the rise in counterfeit ID production poses a significant challenge, necessitating the development of reliable and efficient automated verification methods. This paper introduces IDTrust, a deep-learning framework for assessing the quality of IDs. IDTrust is a system that enhances the quality of identification documents by using a deep learning-based approach. This method eliminates the need for relying on original document patterns for quality checks and pre-processing steps for alignment. As a result, it offers significant improvements in terms of dataset applicability. By utilizing a bandpass filtering-based method, the system aims to effectively detect and differentiate ID quality. Comprehensive experiments on the MIDV-2020 and L3i-ID datasets identify optimal parameters, significantly improving discrimination performance and effectively distinguishing between original and scanned ID documents.
△ Less
Submitted 26 June, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language
Authors:
Souhail Bakkali,
Sanket Biswas,
Zuheng Ming,
Mickael Coustaty,
Marçal Rusiñol,
Oriol Ramos Terrades,
Josep Lladós
Abstract:
The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on O…
▽ More
The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification
Authors:
Souhail Bakkali,
Ziheng Ming,
Mickael Coustaty,
Marçal Rusiñol
Abstract:
In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images have encountered the problem of low inter-class discrimination, and high intra-class structural…
▽ More
In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images have encountered the problem of low inter-class discrimination, and high intra-class structural variations between its categories. In parallel, text-level understanding jointly learned with the corresponding visual properties within a given document image has considerably improved the classification performance in terms of accuracy. In this paper, we design a self-attention-based fusion module that serves as a block in our ensemble trainable network. It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage. Besides, we encourage mutual learning by transferring the positive knowledge between image and text modalities during the training stage. This constraint is realized by adding a truncated-Kullback-Leibler divergence loss Tr-KLD-Reg as a new regularization term, to the conventional supervised setting. To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification. The experimental results illustrate the effectiveness of our approach in terms of accuracy for the single-modal and multi-modal modalities. Thus, the proposed ensemble self-attention-based mutual learning model outperforms the state-of-the-art classification results based on the benchmark RVL-CDIP and Tobacco-3482 datasets.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
Authors:
Souhail Bakkali,
Zuheng Ming,
Mickael Coustaty,
Marçal Rusiñol,
Oriol Ramos Terrades
Abstract:
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from…
▽ More
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space}. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generality of our model on low-scale and large-scale datasets.
△ Less
Submitted 11 May, 2023; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Face Detection in Camera Captured Images of Identity Documents under Challenging Conditions
Authors:
Souhail Bakkali,
Zuheng Ming,
Muhammad Muzzamil Luqman,
Jean-Christophe Burie
Abstract:
Benefiting from the advance of deep convolutional neural network approaches (CNNs), many face detection algorithms have achieved state-of-the-art performance in terms of accuracy and very high speed in unconstrained applications. However, due to the lack of public datasets and due to the variation of the orientation of face images, the complex background and lighting, defocus and the varying illum…
▽ More
Benefiting from the advance of deep convolutional neural network approaches (CNNs), many face detection algorithms have achieved state-of-the-art performance in terms of accuracy and very high speed in unconstrained applications. However, due to the lack of public datasets and due to the variation of the orientation of face images, the complex background and lighting, defocus and the varying illumination of camera captured images, face detection on identity documents under unconstrained environments has not been sufficiently studied. To address this problem more efficiently, we survey three state-of-the-art face detection methods based on general images, i.e. Cascade-CNN, MTCNN and PCN, for face detection in camera captured images of identity documents, given different image quality assessments. For that, The MIDV-500 dataset, which is the largest and most challenging dataset for identity documents, is used to evaluate the three methods. The evaluation results show the performance and the limitations of the current methods for face detection on identity documents under the wild complex environments. These results show that the face detection task in camera captured images of identity documents is challenging, providing a space to improve in the future works.
△ Less
Submitted 8 November, 2019;
originally announced November 2019.
-
The application of the competency-based approach to assess the training and employment adequacy problem
Authors:
Zineb Ait Haddouchane,
Soumia Bakkali,
Souad Ajana,
Karim Gassemi
Abstract:
This review paper fits in the context of the adequate matching of training to employment, which is one of the main challenges that universities around the world strive to meet. In higher education, the revision of curricula necessitates a return to the skills required by the labor market to train skilled labors.
In this research, we started with the presentation of the conceptual framework. Then…
▽ More
This review paper fits in the context of the adequate matching of training to employment, which is one of the main challenges that universities around the world strive to meet. In higher education, the revision of curricula necessitates a return to the skills required by the labor market to train skilled labors.
In this research, we started with the presentation of the conceptual framework. Then we quoted different currents that discussed the problematic of the job training match from various perspectives. We proceeded to choose some studies that have attempted to remedy this problem by adopting the competency-based approach that involves the referential line. This approach has as a main characteristic the attainment of the match between training and employment. Therefore, it is a relevant solution for this problem. We scrutinized the selected studies, presenting their objectives, methodologies and results, and we provided our own analysis. Then, we focused on the Moroccan context through observations and studies already conducted. And finally, we introduced the problematic of our future project.
△ Less
Submitted 11 April, 2017;
originally announced April 2017.