-
The Impact of Feature Representation on the Accuracy of Photonic Neural Networks
Authors:
Mauricio Gomes de Queiroz,
Paul Jimenez,
Raphael Cardoso,
Mateus Vidaletti Costa,
Mohab Abdalla,
Ian O'Connor,
Alberto Bosio,
Fabio Pavanello
Abstract:
Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this e…
▽ More
Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network's handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network's ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.
△ Less
Submitted 28 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset
Authors:
Abdelrahman Abdallah,
Mahmoud Abdalla,
Mahmoud SalahEldin Kasem,
Mohamed Mahmoud,
Ibrahim Abdelhalim,
Mohamed Elkasaby,
Yasser ElBendary,
Adam Jatowt
Abstract:
In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and informati…
▽ More
In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (https://github.com/Update-For-Integrated-Business-AI/CORU).
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Video Anomaly Detection in 10 Years: A Survey and Outlook
Authors:
Moshira Abdalla,
Sajid Javed,
Muaz Al Radi,
Anwaar Ulhaq,
Naoufel Werghi
Abstract:
Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emergi…
▽ More
Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including large-scale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to sha** the future of VAD research.
△ Less
Submitted 30 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages
Authors:
Nedjma Ousidhoum,
Shamsuddeen Hassan Muhammad,
Mohamed Abdalla,
Idris Abdulmumin,
Ibrahim Said Ahmad,
Sanchit Ahuja,
Alham Fikri Aji,
Vladimir Araujo,
Meriem Beloucif,
Christine De Kock,
Oumaima Hourrane,
Manish Shrivastava,
Thamar Solorio,
Nirmal Surange,
Krishnapriya Vishnubhotla,
Seid Muhie Yimam,
Saif M. Mohammad
Abstract:
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. The…
▽ More
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.
△ Less
Submitted 17 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
Authors:
Abdelrahman Abdallah,
Mahmoud Kasem,
Mahmoud Abdalla,
Mohamed Mahmoud,
Mohamed Elkasaby,
Yasser Elbendary,
Adam Jatowt
Abstract:
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with…
▽ More
In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Complexity Assessment of Analog Security Primitives Using the Disentropy of Autocorrelation
Authors:
Paul Jimenez,
Raphael Cardoso,
Maurìcio Gomes de Queiroz,
Mohab Abdalla,
Cédric Marchand,
Xavier Letartre,
Fabio Pavanello
Abstract:
The study of regularity in signals can be of great importance, typically in medicine to analyse electrocardiogram (ECG) or electromyography (EMG) signals, but also in climate studies, finance or security. In this work we focus on security primitives such as Physical Unclonable Functions (PUFs) or Pseudo-Random Number Generators (PRNGs). Such primitives must have a high level of complexity or entro…
▽ More
The study of regularity in signals can be of great importance, typically in medicine to analyse electrocardiogram (ECG) or electromyography (EMG) signals, but also in climate studies, finance or security. In this work we focus on security primitives such as Physical Unclonable Functions (PUFs) or Pseudo-Random Number Generators (PRNGs). Such primitives must have a high level of complexity or entropy in their responses to guarantee enough security for their applications. There are several ways of assessing the complexity of their responses, especially in the binary domain. With the development of analog PUFs such as optical (photonic) PUFs, it would be useful to be able to assess their complexity in the analog domain when designing them, for example, before converting analog signals into binary. In this numerical study, we decided to explore the potential of the disentropy of autocorrelation as a measure of complexity for security primitives as PUFs or PRNGs with analog output or responses. We compare this metric to others used to assess regularities in analog signals such as Approximate Entropy (ApEn) and Fuzzy Entropy (FuzEn). We show that the disentropy of autocorrelation is able to differentiate between well-known PRNGs and non-optimised or bad PRNGs in the analog and binary domain with a better contrast than ApEn and FuzEn. Next, we show that the disentropy of autocorrelation is able to detect small patterns injected in PUFs responses and then we applied it to photonic PUFs simulations.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Citation Amnesia: NLP and Other Academic Fields Are in a Citation Age Recession
Authors:
Jan Philip Wahle,
Terry Ruas,
Mohamed Abdalla,
Bela Gipp,
Saif M. Mohammad
Abstract:
This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to these other fields over time or whether differences can be observed. Our analysis, based on a dataset of approximately 240 million papers, revea…
▽ More
This study examines the tendency to cite older work across 20 fields of study over 43 years (1980--2023). We put NLP's propensity to cite older work in the context of these 20 other fields to analyze whether NLP shows similar temporal citation patterns to these other fields over time or whether differences can be observed. Our analysis, based on a dataset of approximately 240 million papers, reveals a broader scientific trend: many fields have markedly declined in citing older works (e.g., psychology, computer science). We term this decline a 'citation age recession', analogous to how economists define periods of reduced economic activity. The trend is strongest in NLP and ML research (-12.8% and -5.5% in citation age from previous peaks). Our results suggest that citing more recent works is not directly driven by the growth in publication rates (-3.4% across fields; -5.2% in humanities; -5.5% in formal sciences) -- even when controlling for an increase in the volume of papers. Our findings raise questions about the scientific community's engagement with past literature, particularly for NLP, and the potential consequences of neglecting older but relevant research. The data and a demo showcasing our results are publicly available.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Attacking Large Language Models with Projected Gradient Descent
Authors:
Simon Geisler,
Tom Wollschläger,
M. H. I. Abdalla,
Johannes Gasteiger,
Stephan Günnemann
Abstract:
Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Desce…
▽ More
Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. To remedy this, we revisit Projected Gradient Descent (PGD) on the continuously relaxed input prompt. Although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Authors:
Nedjma Ousidhoum,
Shamsuddeen Hassan Muhammad,
Mohamed Abdalla,
Idris Abdulmumin,
Ibrahim Said Ahmad,
Sanchit Ahuja,
Alham Fikri Aji,
Vladimir Araujo,
Abinew Ali Ayele,
Pavan Baswani,
Meriem Beloucif,
Chris Biemann,
Sofia Bourhim,
Christine De Kock,
Genet Shanko Dekebo,
Oumaima Hourrane,
Gopichand Kanumolu,
Lokesh Madasu,
Samuel Rutunda,
Manish Shrivastava,
Thamar Solorio,
Nirmal Surange,
Hailegnaw Getaneh Tilaye,
Krishnapriya Vishnubhotla,
Genta Winata
, et al. (2 additional authors not shown)
Abstract:
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dat…
▽ More
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: \textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and \textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.
△ Less
Submitted 31 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Collaboration or Corporate Capture? Quantifying NLP's Reliance on Industry Artifacts and Contributions
Authors:
Will Aitken,
Mohamed Abdalla,
Karen Rudie,
Catherine Stinson
Abstract:
Impressive performance of pre-trained models has garnered public attention and made news headlines in recent years. Almost always, these models are produced by or in collaboration with industry. Using them is critical for competing on natural language processing (NLP) benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine the de…
▽ More
Impressive performance of pre-trained models has garnered public attention and made news headlines in recent years. Almost always, these models are produced by or in collaboration with industry. Using them is critical for competing on natural language processing (NLP) benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine the degree to which researchers rely on industry models, other artifacts, and contributions to publish in prestigious NLP venues and found that the ratio of their citation is at least three times greater than what would be expected. Our work serves as a scaffold to enable future researchers to more accurately address whether: 1) Collaboration with industry is still collaboration in the absence of an alternative or 2) if NLP inquiry has been captured by the motivations and research direction of private corporations.
△ Less
Submitted 22 June, 2024; v1 submitted 6 December, 2023;
originally announced December 2023.
-
Benchmarking bias: Expanding clinical AI model card to incorporate bias reporting of social and non-social factors
Authors:
Carolina A. M. Heming,
Mohamed Abdalla,
Monish Ahluwalia,
Linglin Zhang,
Hari Trivedi,
MinJae Woo,
Benjamin Fine,
Judy Wawira Gichoya,
Leo Anthony Celi,
Laleh Seyyed-Kalantari
Abstract:
Clinical AI model reporting cards should be expanded to incorporate a broad bias reporting of both social and non-social factors. Non-social factors consider the role of other factors, such as disease dependent, anatomic, or instrument factors on AI model bias, which are essential to ensure safe deployment.
Clinical AI model reporting cards should be expanded to incorporate a broad bias reporting of both social and non-social factors. Non-social factors consider the role of other factors, such as disease dependent, anatomic, or instrument factors on AI model bias, which are essential to ensure safe deployment.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields
Authors:
Jan Philip Wahle,
Terry Ruas,
Mohamed Abdalla,
Bela Gipp,
Saif M. Mohammad
Abstract:
Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on…
▽ More
Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP papers to other papers, and ~1.8m citations from other papers to NLP papers. We show that, unlike most fields, the cross-field engagement of NLP, measured by our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in 1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown more insular -- citing increasingly more NLP papers and having fewer papers that act as bridges between fields. NLP citations are dominated by computer science; Less than 8% of NLP citations are to linguistics, and less than 3% are to math and psychology. These findings underscore NLP's urgent need to reflect on its engagement with various fields.
△ Less
Submitted 1 July, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Balancing exploration and exploitation phases in whale optimization algorithm: an insightful and empirical analysis
Authors:
Aram M. Ahmed,
Tarik A. Rashid,
Bryar A. Hassan,
Jaffer Majidpour,
Kaniaw A. Noori,
Chnoor Maheadeen Rahman,
Mohmad Hussein Abdalla,
Shko M. Qader,
Noor Tayfor,
Naufel B Mohammed
Abstract:
Agents of any metaheuristic algorithms are moving in two modes, namely exploration and exploitation. Obtaining robust results in any algorithm is strongly dependent on how to balance between these two modes. Whale optimization algorithm as a robust and well recognized metaheuristic algorithm in the literature, has proposed a novel scheme to achieve this balance. It has also shown superior results…
▽ More
Agents of any metaheuristic algorithms are moving in two modes, namely exploration and exploitation. Obtaining robust results in any algorithm is strongly dependent on how to balance between these two modes. Whale optimization algorithm as a robust and well recognized metaheuristic algorithm in the literature, has proposed a novel scheme to achieve this balance. It has also shown superior results on a wide range of applications. Moreover, in the previous chapter, an equitable and fair performance evaluation of the algorithm was provided. However, to this point, only comparison of the final results is considered, which does not explain how these results are obtained. Therefore, this chapter attempts to empirically analyze the WOA algorithm in terms of the local and global search capabilities i.e. the ratio of exploration and exploitation phases. To achieve this objective, the dimension-wise diversity measurement is employed, which, at various stages of the optimization process, statistically evaluates the population's convergence and diversity.
△ Less
Submitted 3 September, 2023;
originally announced October 2023.
-
Equitable and Fair Performance Evaluation of Whale Optimization Algorithm
Authors:
Bryar A. Hassan,
Tarik A. Rashid,
Aram Ahmed,
Shko M. Qader,
Jaffer Majidpour,
Mohmad Hussein Abdalla,
Noor Tayfor,
Hozan K. Hamarashid,
Haval Sidqi,
Kaniaw A. Noori
Abstract:
It is essential that all algorithms are exhaustively, somewhat, and intelligently evaluated. Nonetheless, evaluating the effectiveness of optimization algorithms equitably and fairly is not an easy process for various reasons. Choosing and initializing essential parameters, such as the size issues of the search area for each method and the number of iterations required to reduce the issues, might…
▽ More
It is essential that all algorithms are exhaustively, somewhat, and intelligently evaluated. Nonetheless, evaluating the effectiveness of optimization algorithms equitably and fairly is not an easy process for various reasons. Choosing and initializing essential parameters, such as the size issues of the search area for each method and the number of iterations required to reduce the issues, might be particularly challenging. As a result, this chapter aims to contrast the Whale Optimization Algorithm (WOA) with the most recent algorithms on a selected set of benchmark problems with varying benchmark function hardness scores and initial control parameters comparable problem dimensions and search space. When solving a wide range of numerical optimization problems with varying difficulty scores, dimensions, and search areas, the experimental findings suggest that WOA may be statistically superior or inferior to the preceding algorithms referencing convergence speed, running time, and memory utilization.
△ Less
Submitted 4 September, 2023;
originally announced October 2023.
-
AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification
Authors:
Abdelrahman Abdallah,
Mahmoud Abdalla,
Mohamed Elkasaby,
Yasser Elbendary,
Adam Jatowt
Abstract:
The extraction of key information from receipts is a complex task that involves the recognition and extraction of text from scanned receipts. This process is crucial as it enables the retrieval of essential content and organizing it into structured documents for easy access and analysis. In this paper, we present AMuRD, a novel multilingual human-annotated dataset specifically designed for informa…
▽ More
The extraction of key information from receipts is a complex task that involves the recognition and extraction of text from scanned receipts. This process is crucial as it enables the retrieval of essential content and organizing it into structured documents for easy access and analysis. In this paper, we present AMuRD, a novel multilingual human-annotated dataset specifically designed for information extraction from receipts. This dataset comprises $47,720$ samples and addresses the key challenges in information extraction and item classification - the two critical aspects of data analysis in the retail industry. Each sample includes annotations for item names and attributes such as price, brand, and more. This detailed annotation facilitates a comprehensive understanding of each item on the receipt. Furthermore, the dataset provides classification into $44$ distinct product categories. This classification feature allows for a more organized and efficient analysis of the items, enhancing the usability of the dataset for various applications. In our study, we evaluated various language model architectures, e.g., by fine-tuning LLaMA models on the AMuRD dataset. Our approach yielded exceptional results, with an F1 score of 97.43\% and accuracy of 94.99\% in information extraction and classification, and an even higher F1 score of 98.51\% and accuracy of 97.06\% observed in specific tasks. The dataset and code are publicly accessible for further researchhttps://github.com/Update-For-Integrated-Business-AI/AMuRD.
△ Less
Submitted 26 March, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research
Authors:
Mohamed Abdalla,
Jan Philip Wahle,
Terry Ruas,
Aurélie Névéol,
Fanny Ducel,
Saif M. Mohammad,
Karën Fort
Abstract:
Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence…
▽ More
Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.
△ Less
Submitted 1 July, 2024; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Deep learning for table detection and structure recognition: A survey
Authors:
Mahmoud Kasem,
Abdelrahman Abdallah,
Alexander Berendeyev,
Ebrahem Elkady,
Mahmoud Abdalla,
Mohamed Mahmoud,
Mohamed Hamada,
Daniyar Nurseitov,
Islam Taj-Eddin
Abstract:
Tables are everywhere, from scientific journals, papers, websites, and newspapers all the way to items we buy at the supermarket. Detecting them is thus of utmost importance to automatically understanding the content of a document. The performance of table detection has substantially increased thanks to the rapid development of deep learning networks. The goals of this survey are to provide a prof…
▽ More
Tables are everywhere, from scientific journals, papers, websites, and newspapers all the way to items we buy at the supermarket. Detecting them is thus of utmost importance to automatically understanding the content of a document. The performance of table detection has substantially increased thanks to the rapid development of deep learning networks. The goals of this survey are to provide a profound comprehension of the major developments in the field of Table Detection, offer insight into the different methodologies, and provide a systematic taxonomy of the different approaches. Furthermore, we provide an analysis of both classic and new applications in the field. Lastly, the datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature. Finally, we go over the architecture of utilizing various object detection and table structure recognition methods to create an effective and efficient system, as well as a set of development trends to keep up with state-of-the-art algorithms and future research. We have also set up a public GitHub repository where we will be updating the most recent publications, open data, and source code. The GitHub repository is available at https://github.com/abdoelsayed2016/table-detection-structure-recognition.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Hitless memory-reconfigurable photonic reservoir computing architecture
Authors:
Mohab Abdalla,
Clément Zrounba,
Raphael Cardoso,
Paul Jimenez,
Guanghui Ren,
Andreas Boes,
Arnan Mitchell,
Alberto Bosio,
Ian O'Connor,
Fabio Pavanello
Abstract:
Reservoir computing is an analog bio-inspired computation model for efficiently processing time-dependent signals, the photonic implementations of which promise a combination of massive parallel information processing, low power consumption, and high speed operation. However, most implementations, especially for the case of time-delay reservoir computing (TDRC), require signal attenuation in the r…
▽ More
Reservoir computing is an analog bio-inspired computation model for efficiently processing time-dependent signals, the photonic implementations of which promise a combination of massive parallel information processing, low power consumption, and high speed operation. However, most implementations, especially for the case of time-delay reservoir computing (TDRC), require signal attenuation in the reservoir to achieve the desired system dynamics for a specific task, often resulting in large amounts of power being coupled outside of the system. We propose a novel TDRC architecture based on an asymmetric Mach-Zehnder interferometer (MZI) integrated in a resonant cavity which allows the memory capacity of the system to be tuned without the need for an optical attenuator block. Furthermore, this can be leveraged to find the optimal value for the specific components of the total memory capacity metric. We demonstrate this approach on the temporal bitwise XOR task and conclude that this way of memory capacity reconfiguration allows optimal performance to be achieved for memory-specific tasks.
△ Less
Submitted 17 May, 2023; v1 submitted 13 July, 2022;
originally announced July 2022.
-
What Makes Sentences Semantically Related: A Textual Relatedness Dataset and Empirical Study
Authors:
Mohamed Abdalla,
Krishnapriya Vishnubhotla,
Saif M. Mohammad
Abstract:
The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. Additionally, automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity, a subset of relatedness, because of a lack of relatedness datasets. In this paper, we int…
▽ More
The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. Additionally, automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity, a subset of relatedness, because of a lack of relatedness datasets. In this paper, we introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated using a comparative annotation framework, resulting in fine-grained scores. We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84. We use the dataset to explore questions on what makes sentences semantically related. We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks.
Our dataset, data statement, and annotation questionnaire can be found at: https://doi.org/10.5281/zenodo.7599667
△ Less
Submitted 20 March, 2023; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Examining the rhetorical capacities of neural language models
Authors:
Zining Zhu,
Chuer Pan,
Mohamed Abdalla,
Frank Rudzicz
Abstract:
Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capa…
▽ More
Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.
△ Less
Submitted 4 October, 2020; v1 submitted 30 September, 2020;
originally announced October 2020.
-
The Grey Hoodie Project: Big Tobacco, Big Tech, and the threat on academic integrity
Authors:
Mohamed Abdalla,
Moustafa Abdalla
Abstract:
As governmental bodies rely on academics' expert advice to shape policy regarding Artificial Intelligence, it is important that these academics not have conflicts of interests that may cloud or bias their judgement. Our work explores how Big Tech can actively distort the academic landscape to suit its needs. By comparing the well-studied actions of another industry (Big Tobacco) to the current act…
▽ More
As governmental bodies rely on academics' expert advice to shape policy regarding Artificial Intelligence, it is important that these academics not have conflicts of interests that may cloud or bias their judgement. Our work explores how Big Tech can actively distort the academic landscape to suit its needs. By comparing the well-studied actions of another industry (Big Tobacco) to the current actions of Big Tech we see similar strategies employed by both industries. These strategies enable either industry to sway and influence academic and public discourse. We examine the funding of academic research as a tool used by Big Tech to put forward a socially responsible public image, influence events hosted by and decisions made by funded universities, influence the research questions and plans of individual scientists, and discover receptive academics who can be leveraged. We demonstrate how Big Tech can affect academia from the institutional level down to individual researchers. Thus, we believe that it is vital, particularly for universities and other institutions of higher learning, to discuss the appropriateness and the tradeoffs of accepting funding from Big Tech, and what limitations or conditions should be put in place.
△ Less
Submitted 27 April, 2021; v1 submitted 28 September, 2020;
originally announced September 2020.
-
Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings
Authors:
Haoran Zhang,
Amy X. Lu,
Mohamed Abdalla,
Matthew McDermott,
Marzyeh Ghassemi
Abstract:
In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships…
▽ More
In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.
△ Less
Submitted 11 March, 2020;
originally announced March 2020.
-
Cross-Lingual Sentiment Analysis Without (Good) Translation
Authors:
Mohamed Abdalla,
Graeme Hirst
Abstract:
Current approaches to cross-lingual sentiment analysis try to leverage the wealth of labeled English data using bilingual lexicons, bilingual vector space embeddings, or machine translation systems. Here we show that it is possible to use a single linear transformation, with as few as 2000 word pairs, to capture fine-grained sentiment relationships between words in a cross-lingual setting. We appl…
▽ More
Current approaches to cross-lingual sentiment analysis try to leverage the wealth of labeled English data using bilingual lexicons, bilingual vector space embeddings, or machine translation systems. Here we show that it is possible to use a single linear transformation, with as few as 2000 word pairs, to capture fine-grained sentiment relationships between words in a cross-lingual setting. We apply these cross-lingual sentiment models to a diverse set of tasks to demonstrate their functionality in a non-English context. By effectively leveraging English sentiment knowledge without the need for accurate translation, we can analyze and extract features from other languages with scarce data at a very low cost, thus making sentiment and related analyses for many languages inexpensive.
△ Less
Submitted 24 October, 2017; v1 submitted 5 July, 2017;
originally announced July 2017.
-
Arabic Character Segmentation Using Projection Based Approach with Profile's Amplitude Filter
Authors:
Mahmoud A. A. Mousa,
Mohammed S. Sayed,
Mahmoud I. Abdalla
Abstract:
Arabic is one of the languages that present special challenges to Optical character recognition (OCR). The main challenge in Arabic is that it is mostly cursive. Therefore, a segmentation process must be carried out to determine where the character begins and where it ends. This step is essential for character recognition. This paper presents Arabic character segmentation algorithm. The proposed a…
▽ More
Arabic is one of the languages that present special challenges to Optical character recognition (OCR). The main challenge in Arabic is that it is mostly cursive. Therefore, a segmentation process must be carried out to determine where the character begins and where it ends. This step is essential for character recognition. This paper presents Arabic character segmentation algorithm. The proposed algorithm uses the projection-based approach concepts to separate lines, words, and characters. This is done using profile's amplitude filter and simple edge tool to find characters separations. Our algorithm shows promising performance when applied on different machine printed documents with different Arabic fonts.
△ Less
Submitted 3 July, 2017;
originally announced July 2017.
-
Treatment the Effects of Studio Wall Resonance and Coincidence Phenomena for Recording Noisy Speech Via FPGA Digital Filter
Authors:
Mahmoud I. A. Abdalla
Abstract:
This work introduces an economic solution for the problems of sound insulation of recording studios. Sound insulation at wall resonance frequency is weak. Instead of acoustical treatment, a digital filter is used to eliminate the effects of wall resonance and coincidence phenomena on recording of speech. Sound insulation of studio is measured to calculate the wall resonance frequency and the coinc…
▽ More
This work introduces an economic solution for the problems of sound insulation of recording studios. Sound insulation at wall resonance frequency is weak. Instead of acoustical treatment, a digital filter is used to eliminate the effects of wall resonance and coincidence phenomena on recording of speech. Sound insulation of studio is measured to calculate the wall resonance frequency and the coincidence frequency. Pole /zero placement technique is used to calculate the IIR filter coefficients. The digital filter is designed, simulated and implemented. The proposed system is used to treat these problems and it is shown to be effective in recording the noisy speech. In this work digital signal processing is used instead of the acoustic treatment to eliminate the effect of noise at the studio wall resonance. This technique is cheap and effective in canceling the noise at the desired frequencies. Field Programmable Gate Array (FPGA) is used for hardware implementation of the proposed filter structure which provides fast and cheap solution for processing real time audio signals. The implementation is carried out using Spartan chip from Xinlinx achieving higher performance than commercially available software solutions.
△ Less
Submitted 4 June, 2010;
originally announced June 2010.
-
Wavelet-Based Mel-Frequency Cepstral Coefficients for Speaker Identification using Hidden Markov Models
Authors:
Mahmoud I. Abdalla,
Hanaa S. Ali
Abstract:
To improve the performance of speaker identification systems, an effective and robust method is proposed to extract speech features, capable of operating in noisy environment. Based on the time-frequency multi-resolution property of wavelet transform, the input speech signal is decomposed into various frequency channels. For capturing the characteristic of the signal, the Mel-Frequency Cepstral Co…
▽ More
To improve the performance of speaker identification systems, an effective and robust method is proposed to extract speech features, capable of operating in noisy environment. Based on the time-frequency multi-resolution property of wavelet transform, the input speech signal is decomposed into various frequency channels. For capturing the characteristic of the signal, the Mel-Frequency Cepstral Coefficients (MFCCs) of the wavelet channels are calculated. Hidden Markov Models (HMMs) were used for the recognition stage as they give better recognition for the speaker's features than Dynamic Time War** (DTW). Comparison of the proposed approach with the MFCCs conventional feature extraction method shows that the proposed method not only effectively reduces the influence of noise, but also improves recognition. A recognition rate of 99.3% was obtained using the proposed feature extraction technique compared to 98.7% using the MFCCs. When the test patterns were corrupted by additive white Gaussian noise with 20 dB S/N ratio, the recognition rate was 97.3% using the proposed method compared to 93.3% using the MFCCs.
△ Less
Submitted 29 March, 2010;
originally announced March 2010.
-
Influence of Micro-Cantilever Geometry and Gap on Pull-in Voltage
Authors:
W. Faris,
H. Mohammed,
M. M. Abdalla,
C. -H. Ling
Abstract:
In this paper, we study the behaviour of a microcantilever beam under electrostatic actuation using finite difference method. This problem has a lot of applications in MEMS based devices like accelerometers, switches and others. In this paper, we formulated the problem of a cantilever beam with proof mass at its end and carried out the finite difference solution. we studied the effects of length…
▽ More
In this paper, we study the behaviour of a microcantilever beam under electrostatic actuation using finite difference method. This problem has a lot of applications in MEMS based devices like accelerometers, switches and others. In this paper, we formulated the problem of a cantilever beam with proof mass at its end and carried out the finite difference solution. we studied the effects of length, width, and the gap size on the pull-in voltage using data that are available in the literature. Also, the stability limit is compared with the single degree of freedom commonly used in the earlier literature as an approximation to calculate the pull-in voltage.
△ Less
Submitted 21 November, 2007;
originally announced November 2007.
-
Share and Disperse: How to Resist Against Aggregator Compromises in Sensor Networks
Authors:
Thomas Claveirole,
Marcelo Dias de Amorim,
Michel Abdalla,
Yannis Viniotis
Abstract:
A common approach to overcome the limited nature of sensor networks is to aggregate data at intermediate nodes. A challenging issue in this context is to guarantee end-to-end security mainly because sensor networks are extremely vulnerable to node compromises. In order to secure data aggregation, in this paper we propose three schemes that rely on multipath routing. The first one guarantees data…
▽ More
A common approach to overcome the limited nature of sensor networks is to aggregate data at intermediate nodes. A challenging issue in this context is to guarantee end-to-end security mainly because sensor networks are extremely vulnerable to node compromises. In order to secure data aggregation, in this paper we propose three schemes that rely on multipath routing. The first one guarantees data confidentiality through secret sharing, while the second and third ones provide data availability through information dispersal. Based on qualitative analysis and implementation, we show that, by applying these schemes, a sensor network can achieve data confidentiality, authenticity, and protection against denial of service attacks even in the presence of multiple compromised nodes.
△ Less
Submitted 13 October, 2006;
originally announced October 2006.