Search | arXiv e-print repository

Unmasking the Imposters: In-Domain Detection of Human vs. Machine-Generated Tweets

Abstract: The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their misuse on social media platforms. We present a methodology using Twitter datasets to examine the generative capabilities of four LLMs: Llama 3, Mistral, Qwen2, and GPT4o. We evaluate 7B and 8B parameter base-instruction models of the three open… ▽ More The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their misuse on social media platforms. We present a methodology using Twitter datasets to examine the generative capabilities of four LLMs: Llama 3, Mistral, Qwen2, and GPT4o. We evaluate 7B and 8B parameter base-instruction models of the three open-source LLMs and validate the impact of further fine-tuning and "uncensored" versions. Our findings show that "uncensored" models with additional in-domain fine-tuning dramatically reduce the effectiveness of automated detection methods. This study addresses a gap by exploring smaller open-source models and the effects of "uncensoring," providing insights into how fine-tuning and content moderation influence machine-generated text detection. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.03920 [pdf, other]

A Roadmap for Multilingual, Multimodal Domain Independent Deception Detection

Authors: Dainis Boumber, Rakesh M. Verma, Fatima Zahra Qachfar

Abstract: Deception, a prevalent aspect of human communication, has undergone a significant transformation in the digital age. With the globalization of online interactions, individuals are communicating in multiple languages and mixing languages on social media, with varied data becoming available in each language and dialect. At the same time, the techniques for detecting deception are similar across the… ▽ More Deception, a prevalent aspect of human communication, has undergone a significant transformation in the digital age. With the globalization of online interactions, individuals are communicating in multiple languages and mixing languages on social media, with varied data becoming available in each language and dialect. At the same time, the techniques for detecting deception are similar across the board. Recent studies have shown the possibility of the existence of universal linguistic cues to deception across domains within the English language; however, the existence of such cues in other languages remains unknown. Furthermore, the practical task of deception detection in low-resource languages is not a well-studied problem due to the lack of labeled data. Another dimension of deception is multimodality. For example, a picture with an altered caption in fake news or disinformation may exist. This paper calls for a comprehensive investigation into the complexities of deceptive language across linguistic boundaries and modalities within the realm of computer security and natural language processing and the possibility of using multilingual transformer models and labeled data in various languages to universally address the task of deception detection. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 6 pages, 1 figure, shorter version in SIAM International Conference on Data Mining (SDM) 2024

ACM Class: I.2.6; I.2.7; I.2.10; K.4.4

Journal ref: Proc. SDM 2024, 396-399

arXiv:2402.03171 [pdf, other]

Homograph Attacks on Maghreb Sentiment Analyzers

Authors: Fatima Zahra Qachfar, Rakesh M. Verma

Abstract: We examine the impact of homograph attacks on the Sentiment Analysis (SA) task of different Arabic dialects from the Maghreb North-African countries. Homograph attacks result in a 65.3% decrease in transformer classification from an F1-score of 0.95 to 0.33 when data is written in "Arabizi". The goal of this study is to highlight LLMs weaknesses' and to prioritize ethical and responsible Machine L… ▽ More We examine the impact of homograph attacks on the Sentiment Analysis (SA) task of different Arabic dialects from the Maghreb North-African countries. Homograph attacks result in a 65.3% decrease in transformer classification from an F1-score of 0.95 to 0.33 when data is written in "Arabizi". The goal of this study is to highlight LLMs weaknesses' and to prioritize ethical and responsible Machine Learning. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: NAML, North Africans in Machine Leaning, NeurIPS, Neural Information Processing Systems

arXiv:2402.01019 [pdf, other]

Domain-Independent Deception: A New Taxonomy and Linguistic Analysis

Authors: Rakesh M. Verma, Nachum Dershowitz, Victor Zeng, Dainis Boumber, Xuting Liu

Abstract: Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call ``domains of deception.'' Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-in… ▽ More Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call ``domains of deception.'' Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-independent deception. We collect these disparate threads of research and investigate domain-independent deception. First, we provide a new computational definition of deception and break down deception into a new taxonomy. Then, we analyze the debate on linguistic cues for deception and supply guidelines for systematic reviews. Finally, we investigate common linguistic features and give evidence for knowledge transfer across different forms of deception. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 33 pages. arXiv admin note: text overlap with arXiv:2207.01738

arXiv:2207.01738 [pdf, other]

Domain-Independent Deception: Definition, Taxonomy and the Linguistic Cues Debate

Authors: Rakesh M. Verma, Nachum Dershowitz, Victor Zeng, Xuting Liu

Abstract: Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call "domains of deception." Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-inde… ▽ More Internet-based economies and societies are drowning in deceptive attacks. These attacks take many forms, such as fake news, phishing, and job scams, which we call "domains of deception." Machine-learning and natural-language-processing researchers have been attempting to ameliorate this precarious situation by designing domain-specific detectors. Only a few recent works have considered domain-independent deception. We collect these disparate threads of research and investigate domain-independent deception along four dimensions. First, we provide a new computational definition of deception and formalize it using probability theory. Second, we break down deception into a new taxonomy. Third, we analyze the debate on linguistic cues for deception and supply guidelines for systematic reviews. Fourth, we provide some evidence and some suggestions for domain-independent deception detection. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: 16 pages, 2 figures

ACM Class: K.6.5

arXiv:2103.08001 [pdf, other]

Claim Verification using a Multi-GAN based Model

Authors: Amartya Hatua, Arjun Mukherjee, Rakesh M. Verma

Abstract: This article describes research on claim verification carried out using a multiple GAN-based model. The proposed model consists of three pairs of generators and discriminators. The generator and discriminator pairs are responsible for generating synthetic data for supported and refuted claims and claim labels. A theoretical discussion about the proposed model is provided to validate the equilibriu… ▽ More This article describes research on claim verification carried out using a multiple GAN-based model. The proposed model consists of three pairs of generators and discriminators. The generator and discriminator pairs are responsible for generating synthetic data for supported and refuted claims and claim labels. A theoretical discussion about the proposed model is provided to validate the equilibrium state of the model. The proposed model is applied to the FEVER dataset, and a pre-trained language model is used for the input text data. The synthetically generated data helps to gain information which helps the model to perform better than state of the art models and other standard classifiers. △ Less

Submitted 20 July, 2021; v1 submitted 14 March, 2021; originally announced March 2021.

Comments: Paper is submitted at LDK 2021 3rd Conference on Language, Data and Knowledge

MSC Class: 68T50

arXiv:2007.07403 [pdf, other]

Modeling Coherency in Generated Emails by Leveraging Deep Neural Learners

Authors: Avisha Das, Rakesh M. Verma

Abstract: Advanced machine learning and natural language techniques enable attackers to launch sophisticated and targeted social engineering-based attacks. To counter the active attacker issue, researchers have since resorted to proactive methods of detection. Email masquerading using targeted emails to fool the victim is an advanced attack method. However automatic text generation requires controlling the… ▽ More Advanced machine learning and natural language techniques enable attackers to launch sophisticated and targeted social engineering-based attacks. To counter the active attacker issue, researchers have since resorted to proactive methods of detection. Email masquerading using targeted emails to fool the victim is an advanced attack method. However automatic text generation requires controlling the context and coherency of the generated content, which has been identified as an increasingly difficult problem. The method used leverages a hierarchical deep neural model which uses a learned representation of the sentences in the input document to generate structured written emails. We demonstrate the generation of short and targeted text messages using the deep model. The global coherency of the synthesized text is evaluated using a qualitative study as well as multiple quantitative measures. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: Accepted for Publication at Computación y Sistemas (CyS); Poster at CiCLing 2019 and WiML@ICML 2020

arXiv:2006.13499 [pdf, other]

Less is More: Exploiting Social Trust to Increase the Effectiveness of a Deception Attack

Authors: Shahryar Baki, Rakesh M. Verma, Arjun Mukherjee, Omprakash Gnawali

Abstract: Cyber attacks such as phishing, IRS scams, etc., still are successful in fooling Internet users. Users are the last line of defense against these attacks since attackers seem to always find a way to bypass security systems. Understanding users' reason about the scams and frauds can help security providers to improve users security hygiene practices. In this work, we study the users' reasoning and… ▽ More Cyber attacks such as phishing, IRS scams, etc., still are successful in fooling Internet users. Users are the last line of defense against these attacks since attackers seem to always find a way to bypass security systems. Understanding users' reason about the scams and frauds can help security providers to improve users security hygiene practices. In this work, we study the users' reasoning and the effectiveness of several variables within the context of the company representative fraud. Some of the variables that we study are: 1) the effect of using LinkedIn as a medium for delivering the phishing message instead of using email, 2) the effectiveness of natural language generation techniques in generating phishing emails, and 3) how some simple customizations, e.g., adding sender's contact info to the email, affect participants perception. The results obtained from the within-subject study show that participants are not prepared even for a well-known attack - company representative fraud. Findings include: approximately 65% mean detection rate and insights into how the success rate changes with the facade and correspondent (sender/receiver) information. A significant finding is that a smaller set of well-chosen strategies is better than a large `mess' of strategies. We also find significant differences in how males and females approach the same company representative fraud. Insights from our work could help defenders in develo** better strategies to evaluate their defenses and in devising better training strategies. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: 15 pages, 6 figures

ACM Class: H.5.m; I.2.7; J.4

arXiv:cs/0010034 [pdf, ps, other]

Static Analysis Techniques for Equational Logic Programming

Authors: Rakesh M. Verma

Abstract: An equational logic program is a set of directed equations or rules, which are used to compute in the obvious way (by replacing equals with ``simpler'' equals). We present static analysis techniques for efficient equational logic programming, some of which have been implemented in $LR^2$, a laboratory for develo** and evaluating fast, efficient, and practical rewriting techniques. Two novel fe… ▽ More An equational logic program is a set of directed equations or rules, which are used to compute in the obvious way (by replacing equals with ``simpler'' equals). We present static analysis techniques for efficient equational logic programming, some of which have been implemented in $LR^2$, a laboratory for develo** and evaluating fast, efficient, and practical rewriting techniques. Two novel features of $LR^2$ are that non-left-linear rules are allowed in most contexts and it has a tabling option based on the congruence-closure based algorithm to compute normal forms. Although, the focus of this research is on the tabling approach some of the techniques are applicable to the untabled approach as well. Our presentation is in the context of $LR^2$, which is an interpreter, but some of the techniques apply to compilation as well. △ Less

Submitted 27 October, 2000; originally announced October 2000.

Comments: Appeared in 1st ACM SIGPLAN Workshop on Rule-based Programming (RULE 2000)

ACM Class: F.3.2; D.3.2

Showing 1–9 of 9 results for author: Verma, R M