-
Phishing Website Detection Using a Combined Model of ANN and LSTM
Authors:
Muhammad Shoaib Farooq,
Hina jabbar
Abstract:
In this digital era, our lives highly depend on the internet and worldwide technology. Wide usage of technology and platforms of communication makes our lives better and easier. But on the other side it carries out some security issues and cruel activities, phishing is one activity of these cruel activities. It is a type of cybercrime, which has the purpose of stealing the personal information of…
▽ More
In this digital era, our lives highly depend on the internet and worldwide technology. Wide usage of technology and platforms of communication makes our lives better and easier. But on the other side it carries out some security issues and cruel activities, phishing is one activity of these cruel activities. It is a type of cybercrime, which has the purpose of stealing the personal information of the computer user, and enterprises, which carry out fake websites that are the copy of the original websites. The attackers used personal information like account IDs, passwords, and usernames for the purpose of some fraudulent activities against the user of the computer. To overcome this problem researchers focused on the machine learning and deep learning approaches. In our study, we are going to use machine learning and deep learning models to identify the fake web pages on the secondary dataset.
△ Less
Submitted 24 March, 2024;
originally announced April 2024.
-
WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data
Authors:
Maurice Weber,
Carlo Siebenschuh,
Rory Butler,
Anton Alexandrov,
Valdemar Thanner,
Georgios Tsolakis,
Haris Jabbar,
Ian Foster,
Bo Li,
Rick Stevens,
Ce Zhang
Abstract:
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However,…
▽ More
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks. In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data. WordScape addresses these limitations. Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations. In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text. Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages. Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
MorphPiece : A Linguistic Tokenizer for Large Language Models
Authors:
Haris Jabbar
Abstract:
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language mode…
▽ More
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks, compared to the OpenAI GPT-2 model. Specifically I evaluated MorphGPT on language modeling tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding benchmark (MTEB) for supervised and unsupervised performance, and lastly with another morphological tokenization scheme (FLOTA, Hoffmann et al., 2022) and find that the model trained on MorphPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations.
△ Less
Submitted 3 February, 2024; v1 submitted 14 July, 2023;
originally announced July 2023.
-
Flow-Adapter Architecture for Unsupervised Machine Translation
Authors:
Yihong Liu,
Haris Jabbar,
Hinrich Schütze
Abstract:
In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each…
▽ More
In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each language using normalizing flows and (b) using a simple transformation of these latent representations for translating from one language to another. This architecture allows for unsupervised training of each language independently. While there is prior work on latent variables for supervised MT, to the best of our knowledge, this is the first work that uses latent variables and normalizing flows for unsupervised MT. We obtain competitive results on several unsupervised MT benchmarks.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Listening to Affected Communities to Define Extreme Speech: Dataset and Experiments
Authors:
Antonis Maronikolakis,
Axel Wisiorek,
Leah Nann,
Haris Jabbar,
Sahana Udupa,
Hinrich Schuetze
Abstract:
Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data - as opposed to giving co…
▽ More
Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data - as opposed to giving companies and governments control over defining and combatting hate speech. This inclusive approach results in datasets more representative of actually occurring online speech and is likely to facilitate the removal of the social media content that marginalized communities view as causing the most harm. Based on XTREMESPEECH, we establish novel tasks with accompanying baselines, provide evidence that cross-country training is generally not feasible due to cultural differences between countries and perform an interpretability analysis of BERT's predictions.
△ Less
Submitted 22 March, 2022;
originally announced March 2022.
-
Disentangling magnetic hardening and molecular spin chain contributions to exchange bias in ferromagnet/molecule bilayers
Authors:
Samy Boukari,
Hashim Jabbar,
Filip Schleicher,
Manuel Gruber,
Jacek Arabski,
Victor Da Costa,
Guy Schmerber,
Prashanth Rengasamy,
Bertrand Vileno,
Wolfgang Weber,
Martin Bowen,
Eric Beaurepaire
Abstract:
We performed SQUID and FMR magnetometry experiments to clarify the relationship between two reported magnetic exchange effects arising from interfacial spin-polarized charge transfer within ferromagnetic metal (FM)/molecule bilayers: the magnetic hardening effect, and spinterface-stabilized molecular spin chains. To disentangle these effects, both of which can affect the FM magnetization reversal,…
▽ More
We performed SQUID and FMR magnetometry experiments to clarify the relationship between two reported magnetic exchange effects arising from interfacial spin-polarized charge transfer within ferromagnetic metal (FM)/molecule bilayers: the magnetic hardening effect, and spinterface-stabilized molecular spin chains. To disentangle these effects, both of which can affect the FM magnetization reversal, we tuned the metal phthalocyanine molecule central site's magnetic moment to selectively enhance or suppress the formation of spin chains within the molecular film. We find that both effects are distinct, and additive. In the process, we 1) extended the list of FM/molecule candidate pairs that are known to generate magnetic exchange effects, 2) experimentally confirmed the predicted increase in anisotropy upon molecular adsorption; and 3) showed that spin chains within the molecular film can enhance magnetic exchange. This magnetic ordering within the organic layer implies a structural ordering. Thus, by distengangling the magnetic hardening and exchange bias contributions, our results confirm, as an echo to progress regarding inorganic spintronic tunnelling, that the milestone of spintronic tunnelling across structurally ordered organic barriers has been reached through previous magnetotransport experiments. This paves the way for solid-state devices studies that exploit the quantum physical properties of spin chains, notably through external stimuli.
△ Less
Submitted 20 December, 2017;
originally announced December 2017.