Search | arXiv e-print repository

Phishing Website Detection Using a Combined Model of ANN and LSTM

Authors: Muhammad Shoaib Farooq, Hina jabbar

Abstract: In this digital era, our lives highly depend on the internet and worldwide technology. Wide usage of technology and platforms of communication makes our lives better and easier. But on the other side it carries out some security issues and cruel activities, phishing is one activity of these cruel activities. It is a type of cybercrime, which has the purpose of stealing the personal information of… ▽ More In this digital era, our lives highly depend on the internet and worldwide technology. Wide usage of technology and platforms of communication makes our lives better and easier. But on the other side it carries out some security issues and cruel activities, phishing is one activity of these cruel activities. It is a type of cybercrime, which has the purpose of stealing the personal information of the computer user, and enterprises, which carry out fake websites that are the copy of the original websites. The attackers used personal information like account IDs, passwords, and usernames for the purpose of some fraudulent activities against the user of the computer. To overcome this problem researchers focused on the machine learning and deep learning approaches. In our study, we are going to use machine learning and deep learning models to identify the fake web pages on the secondary dataset. △ Less

Submitted 24 March, 2024; originally announced April 2024.

Comments: Pages 9, Figures 5

arXiv:2312.10188 [pdf, other]

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data

Authors: Maurice Weber, Carlo Siebenschuh, Rory Butler, Anton Alexandrov, Valdemar Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, Rick Stevens, Ce Zhang

Abstract: We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However,… ▽ More We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks. In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data. WordScape addresses these limitations. Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations. In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text. Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages. Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Datasets and Benchmarks

arXiv:2307.07262 [pdf, other]

MorphPiece : A Linguistic Tokenizer for Large Language Models

Authors: Haris Jabbar

Abstract: Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language mode… ▽ More Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks, compared to the OpenAI GPT-2 model. Specifically I evaluated MorphGPT on language modeling tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding benchmark (MTEB) for supervised and unsupervised performance, and lastly with another morphological tokenization scheme (FLOTA, Hoffmann et al., 2022) and find that the model trained on MorphPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations. △ Less

Submitted 3 February, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

Comments: Manuscript under review. Patent pending

arXiv:2204.12225 [pdf, other]

Flow-Adapter Architecture for Unsupervised Machine Translation

Authors: Yihong Liu, Haris Jabbar, Hinrich Schütze

Abstract: In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each… ▽ More In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each language using normalizing flows and (b) using a simple transformation of these latent representations for translating from one language to another. This architecture allows for unsupervised training of each language independently. While there is prior work on latent variables for supervised MT, to the best of our knowledge, this is the first work that uses latent variables and normalizing flows for unsupervised MT. We obtain competitive results on several unsupervised MT benchmarks. △ Less

Submitted 26 April, 2022; originally announced April 2022.

Comments: ACL 2022

arXiv:2203.11764 [pdf, other]

Listening to Affected Communities to Define Extreme Speech: Dataset and Experiments

Authors: Antonis Maronikolakis, Axel Wisiorek, Leah Nann, Haris Jabbar, Sahana Udupa, Hinrich Schuetze

Abstract: Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data - as opposed to giving co… ▽ More Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data - as opposed to giving companies and governments control over defining and combatting hate speech. This inclusive approach results in datasets more representative of actually occurring online speech and is likely to facilitate the removal of the social media content that marginalized communities view as causing the most harm. Based on XTREMESPEECH, we establish novel tasks with accompanying baselines, provide evidence that cross-country training is generally not feasible due to cultural differences between countries and perform an interpretability analysis of BERT's predictions. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: Accepted to ACL 2022 Findings

arXiv:1712.07450 [pdf]

doi 10.1021/acs.nanolett.8b00570

Disentangling magnetic hardening and molecular spin chain contributions to exchange bias in ferromagnet/molecule bilayers

Authors: Samy Boukari, Hashim Jabbar, Filip Schleicher, Manuel Gruber, Jacek Arabski, Victor Da Costa, Guy Schmerber, Prashanth Rengasamy, Bertrand Vileno, Wolfgang Weber, Martin Bowen, Eric Beaurepaire

Abstract: We performed SQUID and FMR magnetometry experiments to clarify the relationship between two reported magnetic exchange effects arising from interfacial spin-polarized charge transfer within ferromagnetic metal (FM)/molecule bilayers: the magnetic hardening effect, and spinterface-stabilized molecular spin chains. To disentangle these effects, both of which can affect the FM magnetization reversal,… ▽ More We performed SQUID and FMR magnetometry experiments to clarify the relationship between two reported magnetic exchange effects arising from interfacial spin-polarized charge transfer within ferromagnetic metal (FM)/molecule bilayers: the magnetic hardening effect, and spinterface-stabilized molecular spin chains. To disentangle these effects, both of which can affect the FM magnetization reversal, we tuned the metal phthalocyanine molecule central site's magnetic moment to selectively enhance or suppress the formation of spin chains within the molecular film. We find that both effects are distinct, and additive. In the process, we 1) extended the list of FM/molecule candidate pairs that are known to generate magnetic exchange effects, 2) experimentally confirmed the predicted increase in anisotropy upon molecular adsorption; and 3) showed that spin chains within the molecular film can enhance magnetic exchange. This magnetic ordering within the organic layer implies a structural ordering. Thus, by distengangling the magnetic hardening and exchange bias contributions, our results confirm, as an echo to progress regarding inorganic spintronic tunnelling, that the milestone of spintronic tunnelling across structurally ordered organic barriers has been reached through previous magnetotransport experiments. This paves the way for solid-state devices studies that exploit the quantum physical properties of spin chains, notably through external stimuli. △ Less

Submitted 20 December, 2017; originally announced December 2017.

Comments: None

Showing 1–6 of 6 results for author: Jabbar, H