Search | arXiv e-print repository

Taxonomy-Guided Zero-Shot Recommendations with LLMs

Authors: Yueqing Liang, Liangwei Yang, Chen Wang, Xiongxiao Xu, Philip S. Yu, Kai Shu

Abstract: With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To a… ▽ More With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel method using a taxonomy dictionary. This method provides a systematic framework for categorizing and organizing items, improving the clarity and structure of item information. By incorporating the taxonomy dictionary into LLM prompts, we achieve efficient token utilization and controlled feature generation, leading to more accurate and contextually relevant recommendations. Our Taxonomy-guided Recommendation (TaxRec) approach features a two-step process: one-time taxonomy categorization and LLM-based recommendation, enabling zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as personal recommender with LLMs. Code is available at https://github.com/yueqingliang1/TaxRec. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2405.18776 [pdf, other]

LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models

Authors: Qin Yang, Meisam Mohammad, Han Wang, Ali Payani, Ashish Kundu, Kai Shu, Yan Yan, Yuan Hong

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget $ε< 3$). To address such limitations… ▽ More Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget $ε< 3$). To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., $0.1\leq ε<3$). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTa-large (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given $ε=0.3$, $δ=10^{-10}$) by drastically outperforming the Gaussian mechanism (e.g., $\sim 50\%$ for small $ε$ and $δ$). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 18 pages, 15 figures

arXiv:2404.14757 [pdf, other]

Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Authors: Xiongxiao Xu, Yueqing Liang, Baixiang Huang, Zhiling Lan, Kai Shu

Abstract: Time series forecasting is an important problem and plays a key role in a variety of applications including weather forecasting, stock market, and scientific simulations. Although transformers have proven to be effective in capturing dependency, its quadratic complexity of attention mechanism prevents its further adoption in long-range time series forecasting, thus limiting them attend to short-ra… ▽ More Time series forecasting is an important problem and plays a key role in a variety of applications including weather forecasting, stock market, and scientific simulations. Although transformers have proven to be effective in capturing dependency, its quadratic complexity of attention mechanism prevents its further adoption in long-range time series forecasting, thus limiting them attend to short-range range. Recent progress on state space models (SSMs) have shown impressive performance on modeling long range dependency due to their subquadratic complexity. Mamba, as a representative SSM, enjoys linear time complexity and has achieved strong scalability on tasks that requires scaling to long sequences, such as language, audio, and genomics. In this paper, we propose to leverage a hybrid framework Mambaformer that internally combines Mamba for long-range dependency, and Transformer for short range dependency, for long-short range forecasting. To the best of our knowledge, this is the first paper to combine Mamba and Transformer architecture in time series data. We investigate possible hybrid architectures to combine Mamba layer and attention layer for long-short range time series forecasting. The comparative study shows that the Mambaformer family can outperform Mamba and Transformer in long-short range time series forecasting problem. The code is available at https://github.com/XiongxiaoXu/Mambaformerin-Time-Series. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2403.09747 [pdf, other]

Re-Search for The Truth: Multi-round Retrieval-augmented Large Language Models are Strong Fake News Detectors

Authors: Guanghua Li, Wensheng Lu, Wei Zhang, Defu Lian, Kezhong Lu, Rui Mao, Kai Shu, Hao Liao

Abstract: The proliferation of fake news has had far-reaching implications on politics, the economy, and society at large. While Fake news detection methods have been employed to mitigate this issue, they primarily depend on two essential elements: the quality and relevance of the evidence, and the effectiveness of the verdict prediction mechanism. Traditional methods, which often source information from st… ▽ More The proliferation of fake news has had far-reaching implications on politics, the economy, and society at large. While Fake news detection methods have been employed to mitigate this issue, they primarily depend on two essential elements: the quality and relevance of the evidence, and the effectiveness of the verdict prediction mechanism. Traditional methods, which often source information from static repositories like Wikipedia, are limited by outdated or incomplete data, particularly for emerging or rare claims. Large Language Models (LLMs), known for their remarkable reasoning and generative capabilities, introduce a new frontier for fake news detection. However, like traditional methods, LLM-based solutions also grapple with the limitations of stale and long-tail knowledge. Additionally, retrieval-enhanced LLMs frequently struggle with issues such as low-quality evidence retrieval and context length constraints. To address these challenges, we introduce a novel, retrieval-augmented LLMs framework--the first of its kind to automatically and strategically extract key evidence from web sources for claim verification. Employing a multi-round retrieval strategy, our framework ensures the acquisition of sufficient, relevant evidence, thereby enhancing performance. Comprehensive experiments across three real-world datasets validate the framework's superiority over existing methods. Importantly, our model not only delivers accurate verdicts but also offers human-readable explanations to improve result interpretability. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.08213 [pdf, other]

Can Large Language Models Identify Authorship?

Authors: Baixiang Huang, Canyu Chen, Kai Shu

Abstract: The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis, encompassing authorship verification and attribution, remains underexplored. This paper conducts a comprehensive evaluation o… ▽ More The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis, encompassing authorship verification and attribution, remains underexplored. This paper conducts a comprehensive evaluation of LLMs in these critical tasks. Traditional studies have depended on hand-crafted stylistic features, whereas state-of-the-art approaches leverage text embeddings from pre-trained language models. These methods, which typically require fine-tuning on labeled data, often suffer from performance degradation in cross-domain applications and provide limited explainability. This work seeks to address three research questions: (1) Can LLMs perform zero-shot, end-to-end authorship verification effectively? (2) Are LLMs capable of accurately attributing authorship among multiple candidates authors (e.g., 10 and 20)? (3) How can LLMs provide explainability in authorship analysis, particularly through the role of linguistic features? Moreover, we investigate the integration of explicit linguistic features to guide LLMs in their reasoning processes. Our extensive assessment demonstrates LLMs' proficiency in both tasks without the need for domain-specific fine-tuning, providing insights into their decision-making via a detailed analysis of linguistic features. This establishes a new benchmark for future research on LLM-based authorship analysis. The code and data are available at https://github.com/baixianghuang/authorship-llm. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 10 pages, 7 figures

arXiv:2402.04559 [pdf, other]

Can Large Language Model Agents Simulate Human Trust Behaviors?

Authors: Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Kai Shu, Adel Bibi, Ziniu Hu, Philip Torr, Bernard Ghanem, Guohao Li

Abstract: Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust be… ▽ More Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in applications such as social science. However, one fundamental question remains: can LLM agents really simulate human behaviors? In this paper, we focus on one of the most critical behaviors in human interactions, trust, and aim to investigate whether or not LLM agents can simulate human trust behaviors. We first find that LLM agents generally exhibit trust behaviors, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that LLM agents can have high behavioral alignment with humans regarding trust behaviors, particularly for GPT-4, indicating the feasibility to simulate human trust behaviors with LLM agents. In addition, we probe into the biases in agent trust and the differences in agent trust towards agents and humans. We also explore the intrinsic properties of agent trust under conditions including advanced reasoning strategies and external manipulations. We further offer important implications of our discoveries for various scenarios where trust is paramount. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans. △ Less

Submitted 10 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: The first two authors contributed equally. Project website: https://www.camel-ai.org/research/agent-trust

arXiv:2401.15496 [pdf, other]

Baichuan2-Sum: Instruction Finetune Baichuan2-7B Model for Dialogue Summarization

Authors: Jianfei Xiao, Yancan Chen, Yimin Ou, Hanyi Yu, Kai Shu, Yiyong Xiao

Abstract: Large language models (LLMs) like Llama, Baichuan and Bloom models show remarkable ability with instruction fine-tuning in many natural language tasks. Nevertheless, for the dialogue summarization task, which aims to generate summaries for different roles in dialogue, most of the state-of-the-art methods conduct on small models (e.g Bart and Bert). Existing methods try to add task specified optimi… ▽ More Large language models (LLMs) like Llama, Baichuan and Bloom models show remarkable ability with instruction fine-tuning in many natural language tasks. Nevertheless, for the dialogue summarization task, which aims to generate summaries for different roles in dialogue, most of the state-of-the-art methods conduct on small models (e.g Bart and Bert). Existing methods try to add task specified optimization on small models like adding global-local centrality score to models. In this paper, we propose an instruction fine-tuning model: Baichuan2-Sum, for role-oriented diaglouge summarization. By setting different instructions for different roles, the model can learn from the dialogue interactions and output the expected summaries. Furthermore, we applied NEFTune technique to add suitable noise during training to improve the results. The experiments demonstrate that the proposed model achieves the new state-of-the-art results on two public dialogue summarization datasets: CSDS and SAMSUM. We release our model and related codes to facilitate future studies on dialogue summarization task. △ Less

Submitted 3 April, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

arXiv:2401.09090 [pdf, other]

Understanding the concerns and choices of public when using large language models for healthcare

Authors: Yunpeng Xiao, Kyrie Zhixuan Zhou, Yueqing Liang, Kai Shu

Abstract: Large language models (LLMs) have shown their potential in biomedical fields. However, how the public uses them for healthcare purposes such as medical Q\&A, self-diagnosis, and daily healthcare information seeking is under-investigated. In this paper, we adopt a mixed-methods approach, including surveys (N=167) and interviews (N=17) to investigate how and why the public uses LLMs for healthcare.… ▽ More Large language models (LLMs) have shown their potential in biomedical fields. However, how the public uses them for healthcare purposes such as medical Q\&A, self-diagnosis, and daily healthcare information seeking is under-investigated. In this paper, we adopt a mixed-methods approach, including surveys (N=167) and interviews (N=17) to investigate how and why the public uses LLMs for healthcare. LLMs as a healthcare tool have gained popularity, and are often used in combination with other information channels such as search engines and online health communities to optimize information quality. LLMs provide more accurate information and a more convenient interaction/service model compared to traditional channels. LLMs also do a better job of reducing misinformation, especially in daily healthcare questions. Doctors using LLMs for diagnosis is less acceptable than for auxiliary work such as writing medical records. Based on the findings, we reflect on the ethical and effective use of LLMs for healthcare and propose future research directions. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: 22 pages

ACM Class: J.4; K.4.2

arXiv:2401.05561 [pdf, other]

TrustLLM: Trustworthiness in Large Language Models

Authors: Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang , et al. (45 additional authors not shown)

Abstract: Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in… ▽ More Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness. △ Less

Submitted 17 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: This work is still under work and we welcome your contribution

arXiv:2311.11473

CSGNN: Conquering Noisy Node labels via Dynamic Class-wise Selection

Authors: Yifan Li, Zhen Tan, Kai Shu, Zongsheng Cao, Yu Kong, Huan Liu

Abstract: Graph Neural Networks (GNNs) have emerged as a powerful tool for representation learning on graphs, but they often suffer from overfitting and label noise issues, especially when the data is scarce or imbalanced. Different from the paradigm of previous methods that rely on single-node confidence, in this paper, we introduce a novel Class-wise Selection for Graph Neural Networks, dubbed CSGNN, whic… ▽ More Graph Neural Networks (GNNs) have emerged as a powerful tool for representation learning on graphs, but they often suffer from overfitting and label noise issues, especially when the data is scarce or imbalanced. Different from the paradigm of previous methods that rely on single-node confidence, in this paper, we introduce a novel Class-wise Selection for Graph Neural Networks, dubbed CSGNN, which employs a neighbor-aggregated latent space to adaptively select reliable nodes across different classes. Specifically, 1) to tackle the class imbalance issue, we introduce a dynamic class-wise selection mechanism, leveraging the clustering technique to identify clean nodes based on the neighbor-aggregated confidences. In this way, our approach can avoid the pitfalls of biased sampling which is common with global threshold techniques. 2) To alleviate the problem of noisy labels, built on the concept of the memorization effect, CSGNN prioritizes learning from clean nodes before noisy ones, thereby iteratively enhancing model performance while mitigating label noise. Through extensive experiments, we demonstrate that CSGNN outperforms state-of-the-art methods in terms of both effectiveness and robustness. △ Less

Submitted 14 December, 2023; v1 submitted 19 November, 2023; originally announced November 2023.

Comments: For the privacy issue

arXiv:2311.09433 [pdf, other]

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment

Authors: Haoran Wang, Kai Shu

Abstract: To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potent… ▽ More To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. Inspired by recent success in modifying model behavior through steering vectors without the need for optimization, and drawing on its effectiveness in red-teaming LLMs, we conducted experiments employing activation steering to target four key aspects of LLMs: truthfulness, toxicity, bias, and harmfulness - across a varied set of attack settings. To establish a universal attack strategy applicable to diverse target alignments without depending on manual analysis, we automatically select the intervention layer based on contrastive layer search. Our experiment results show that activation attacks are highly effective and add little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks. Our code and data are available at https://github.com/wang2226/Backdoor-Activation-Attack Warning: this paper contains content that can be offensive or upsetting. △ Less

Submitted 24 November, 2023; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.09428 [pdf, other]

Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models

Authors: Yueqing Liang, Lu Cheng, Ali Payani, Kai Shu

Abstract: This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they al… ▽ More This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection. △ Less

Submitted 5 December, 2023; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Under review

arXiv:2311.05656 [pdf, other]

Combating Misinformation in the Age of LLMs: Opportunities and Challenges

Authors: Canyu Chen, Kai Shu

Abstract: Misinformation such as fake news and rumors is a serious threat on information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowl… ▽ More Misinformation such as fake news and rumors is a serious threat on information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities. Thus, one emergent question is: how to utilize LLMs to combat misinformation? On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale. Then, another important question is: how to combat LLM-generated misinformation? In this paper, we first systematically review the history of combating misinformation before the advent of LLMs. Then we illustrate the current efforts and present an outlook for these two fundamental questions respectively. The goal of this survey paper is to facilitate the progress of utilizing LLMs for fighting misinformation and call for interdisciplinary efforts from different stakeholders for combating LLM-generated misinformation. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: 9 pages for the main paper, 35 pages including 656 references, more resources on "LLMs Meet Misinformation" are on the website: https://llm-misinformation.github.io/

arXiv:2310.10429 [pdf, other]

Exploiting User Comments for Early Detection of Fake News Prior to Users' Commenting

Authors: Qiong Nan, Qiang Sheng, Juan Cao, Yongchun Zhu, Danding Wang, Guang Yang, **tao Li, Kai Shu

Abstract: Both accuracy and timeliness are key factors in detecting fake news on social media. However, most existing methods encounter an accuracy-timeliness dilemma: Content-only methods guarantee timeliness but perform moderately because of limited available information, while social context-based ones generally perform better but inevitably lead to latency because of social context accumulation needs. T… ▽ More Both accuracy and timeliness are key factors in detecting fake news on social media. However, most existing methods encounter an accuracy-timeliness dilemma: Content-only methods guarantee timeliness but perform moderately because of limited available information, while social context-based ones generally perform better but inevitably lead to latency because of social context accumulation needs. To break such a dilemma, a feasible but not well-studied solution is to leverage social contexts (e.g., comments) from historical news for training a detection model and apply it to newly emerging news without social contexts. This requires the model to (1) sufficiently learn helpful knowledge from social contexts, and (2) be well compatible with situations that social contexts are available or not. To achieve this goal, we propose to absorb and parameterize useful knowledge from comments in historical news and then inject it into a content-only detection model. Specifically, we design the Comments Assisted Fake News Detection method (CAS-FEND), which transfers useful knowledge from a comments-aware teacher model to a content-only student model during training. The student model is further used to detect newly emerging fake news. Experiments show that the CAS-FEND student model outperforms all content-only methods and even those with 1/4 comments as inputs, demonstrating its superiority for early detection. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 14 pages, 8 figures, 8 tables

arXiv:2310.05253 [pdf, other]

Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models

Authors: Haoran Wang, Kai Shu

Abstract: Claim verification plays a crucial role in combating misinformation. While existing works on claim verification have shown promising results, a crucial piece of the puzzle that remains unsolved is to understand how to verify claims without relying on human-annotated data, which is expensive to create at a large scale. Additionally, it is important for models to provide comprehensive explanations t… ▽ More Claim verification plays a crucial role in combating misinformation. While existing works on claim verification have shown promising results, a crucial piece of the puzzle that remains unsolved is to understand how to verify claims without relying on human-annotated data, which is expensive to create at a large scale. Additionally, it is important for models to provide comprehensive explanations that can justify their decisions and assist human fact-checkers. This paper presents First-Order-Logic-Guided Knowledge-Grounded (FOLK) Reasoning that can verify complex claims and generate explanations without the need for annotated evidence using Large Language Models (LLMs). FOLK leverages the in-context learning ability of LLMs to translate the claim into a First-Order-Logic (FOL) clause consisting of predicates, each corresponding to a sub-claim that needs to be verified. Then, FOLK performs FOL-Guided reasoning over a set of knowledge-grounded question-and-answer pairs to make veracity predictions and generate explanations to justify its decision-making process. This process makes our model highly explanatory, providing clear explanations of its reasoning process in human-readable form. Our experiment results indicate that FOLK outperforms strong baselines on three datasets encompassing various claim verification challenges. Our code and data are available. △ Less

Submitted 19 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: Findings of EMNLP 2023

arXiv:2309.13788 [pdf, other]

Can LLM-Generated Misinformation Be Detected?

Authors: Canyu Chen, Kai Shu

Abstract: The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the pe… ▽ More The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures. △ Less

Submitted 23 April, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

Comments: Accepted to Proceedings of ICLR 2024. 9 pages for main paper, 40 pages including appendix. The code, results, dataset for this paper and more resources on "LLMs Meet Misinformation" have been released on the project website: https://llm-misinformation.github.io/

arXiv:2309.12363 [pdf, other]

Investigating Online Financial Misinformation and Its Consequences: A Computational Perspective

Authors: Aman Rangapur, Haoran Wang, Kai Shu

Abstract: The rapid dissemination of information through digital platforms has revolutionized the way we access and consume news and information, particularly in the realm of finance. However, this digital age has also given rise to an alarming proliferation of financial misinformation, which can have detrimental effects on individuals, markets, and the overall economy. This research paper aims to provide a… ▽ More The rapid dissemination of information through digital platforms has revolutionized the way we access and consume news and information, particularly in the realm of finance. However, this digital age has also given rise to an alarming proliferation of financial misinformation, which can have detrimental effects on individuals, markets, and the overall economy. This research paper aims to provide a comprehensive survey of online financial misinformation, including its types, sources, and impacts. We first discuss the characteristics and manifestations of financial misinformation, encompassing false claims and misleading content. We explore various case studies that illustrate the detrimental consequences of financial misinformation on the economy. Finally, we highlight the potential impact and implications of detecting financial misinformation. Early detection and mitigation strategies can help protect investors, enhance market transparency, and preserve financial stability. We emphasize the importance of greater awareness, education, and regulation to address the issue of online financial misinformation and safeguard individuals and businesses from its harmful effects. In conclusion, this research paper sheds light on the pervasive issue of online financial misinformation and its wide-ranging consequences. By understanding the types, sources, and impacts of misinformation, stakeholders can work towards implementing effective detection and prevention measures to foster a more informed and resilient financial ecosystem. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 32 pages, 2 figures

ACM Class: A.1; I.m

arXiv:2309.08793 [pdf, other]

Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation

Authors: Aman Rangapur, Haoran Wang, Ling Jian, Kai Shu

Abstract: Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and v… ▽ More Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and visual content, Fin-Fact provides complementary information sources to enhance factuality analysis. Its primary objective is combating misinformation in finance, fostering transparency, and building trust in financial reporting and news dissemination. By offering insightful explanations, Fin-Fact empowers users, including domain experts and end-users, to understand the reasoning behind fact-checking decisions, validating claim credibility, and fostering trust in the fact-checking process. The Fin-Fact dataset, along with our experimental codes is available at https://github.com/IIT-DM/Fin-Fact/. △ Less

Submitted 1 May, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: 8 pages, 4 figures, 4 tables

ACM Class: I.2; E.m

arXiv:2306.15231 [pdf, other]

Emulating Reader Behaviors for Fake News Detection

Authors: Junwei Yin, Min Gao, Kai Shu, Zehua Zhao, Yinqiu Huang, Jia Wang

Abstract: The wide dissemination of fake news has affected our lives in many aspects, making fake news detection important and attracting increasing attention. Existing approaches make substantial contributions in this field by modeling news from a single-modal or multi-modal perspective. However, these modal-based methods can result in sub-optimal outcomes as they ignore reader behaviors in news consumptio… ▽ More The wide dissemination of fake news has affected our lives in many aspects, making fake news detection important and attracting increasing attention. Existing approaches make substantial contributions in this field by modeling news from a single-modal or multi-modal perspective. However, these modal-based methods can result in sub-optimal outcomes as they ignore reader behaviors in news consumption and authenticity verification. For instance, they haven't taken into consideration the component-by-component reading process: from the headline, images, comments, to the body, which is essential for modeling news with more granularity. To this end, we propose an approach of Emulating the behaviors of readers (Ember) for fake news detection on social media, incorporating readers' reading and verificating process to model news from the component perspective thoroughly. Specifically, we first construct intra-component feature extractors to emulate the behaviors of semantic analyzing on each component. Then, we design a module that comprises inter-component feature extractors and a sequence-based aggregator. This module mimics the process of verifying the correlation between components and the overall reading and verification sequence. Thus, Ember can handle the news with various components by emulating corresponding sequences. We conduct extensive experiments on nine real-world datasets, and the results demonstrate the superiority of Ember. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: 12 pages

arXiv:2306.13450 [pdf, other]

doi 10.1145/3580305.3599873

MUSER: A MUlti-Step Evidence Retrieval Enhancement Framework for Fake News Detection

Authors: Hao Liao, Jiaohao Peng, Zhanyi Huang, Wei Zhang, Guanghua Li, Kai Shu, Xing Xie

Abstract: The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the… ▽ More The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the textual consistency between the evidence and the claim in the news. In this paper, we propose a framework for fake news detection based on MUlti-Step Evidence Retrieval enhancement (MUSER), which simulates the steps of human beings in the process of reading news, summarizing, consulting materials, and inferring whether the news is true or fake. Our model can explicitly model dependencies among multiple pieces of evidence, and perform multi-step associations for the evidence required for news verification through multi-step retrieval. In addition, our model is able to automatically collect existing evidence through paragraph retrieval and key evidence selection, which can save the tedious process of manual evidence collection. We conducted extensive experiments on real-world datasets in different languages, and the results demonstrate that our proposed model outperforms state-of-the-art baseline methods for detecting fake news by at least 3% in F1-Macro and 4% in F1-Micro. Furthermore, it provides interpretable evidence for end users. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Comments: 12 pages, 5 figures, accepted by KDD '23, ADS track

Journal ref: KDD 2023

arXiv:2306.08256 [pdf, other]

Data Augmentation for Seizure Prediction with Generative Diffusion Model

Authors: Kai Shu, Yuchang Zhao, Le Wu, Ai** Liu, Ruobing Qian, Xun Chen

Abstract: Objective: Seizure prediction is of great importance to improve the life of patients. The focal point is to distinguish preictal states from interictal ones. With the development of machine learning, seizure prediction methods have achieved significant progress. However, the severe imbalance problem between preictal and interictal data still poses a great challenge, restricting the performance of… ▽ More Objective: Seizure prediction is of great importance to improve the life of patients. The focal point is to distinguish preictal states from interictal ones. With the development of machine learning, seizure prediction methods have achieved significant progress. However, the severe imbalance problem between preictal and interictal data still poses a great challenge, restricting the performance of classifiers. Data augmentation is an intuitive way to solve this problem. Existing data augmentation methods generate samples by overlap** or recombining data. The distribution of generated samples is limited by original data, because such transformations cannot fully explore the feature space and offer new information. As the epileptic EEG representation varies among seizures, these generated samples cannot provide enough diversity to achieve high performance on a new seizure. As a consequence, we propose a novel data augmentation method with diffusion model called DiffEEG. Methods: Diffusion models are a class of generative models that consist of two processes. Specifically, in the diffusion process, the model adds noise to the input EEG sample step by step and converts the noisy sample into output random noise, exploring the distribution of data by minimizing the loss between the output and the noise added. In the denoised process, the model samples the synthetic data by removing the noise gradually, diffusing the data distribution to outward areas and narrowing the distance between different clusters. Results: We compared DiffEEG with existing methods, and integrated them into three representative classifiers. The experiments indicate that DiffEEG could further improve the performance and shows superiority to existing methods. Conclusion: This paper proposes a novel and effective method to solve the imbalanced problem and demonstrates the effectiveness and generality of our method. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 12 pages, 6 figures

arXiv:2305.19552 [pdf, other]

Investigating Gender Euphoria and Dysphoria on TikTok: Characterization and Comparison

Authors: SJ Dillon, Yueqing Liang, H. Russell Bernard, Kai Shu

Abstract: With the emergence of short video-sharing platforms, engagement with social media sites devoted to opinion and knowledge dissemination has rapidly increased. Among the short video platforms, TikTok is one of the most popular globally and has become the platform of choice for transgender and nonbinary individuals, who have formed a large community to mobilize personal experience and exchange inform… ▽ More With the emergence of short video-sharing platforms, engagement with social media sites devoted to opinion and knowledge dissemination has rapidly increased. Among the short video platforms, TikTok is one of the most popular globally and has become the platform of choice for transgender and nonbinary individuals, who have formed a large community to mobilize personal experience and exchange information. The knowledge produced in online spaces can influence the ways in which people understand and experience their own gender and transitions, as they hear about others and weigh that experiential and medical knowledge against their own. This paper extends current research and past interview methods on gender euphoria and gender dysphoria to analyze what and how online communities on TikTok discuss these two types of gender experiences. Our findings indicate that gender euphoria and gender dysphoria are differently described in online TikTok spaces. These findings indicate that there are wide similarities in the words used to describe gender dysphoria as well as gender euphoria in both the comments of videos and content creators' hashtags. Finally, our results show that gender euphoria is described in more similar terms between transfeminine and transmasculine experiences than gender dysphoria, which appears to be more differentiated by gendering experience and transition goals. We hope this paper can provide insights for future research on understanding transgender and nonbinary individuals in online communities. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.10668 [pdf, other]

MetaGAD: Learning to Meta Transfer for Few-shot Graph Anomaly Detection

Authors: Xiongxiao Xu, Kaize Ding, Canyu Chen, Kai Shu

Abstract: Graph anomaly detection has long been an important problem in various domains pertaining to information security such as financial fraud, social spam, network intrusion, etc. The majority of existing methods are performed in an unsupervised manner, as labeled anomalies in a large scale are often too expensive to acquire. However, the identified anomalies may turn out to be data noises or uninteres… ▽ More Graph anomaly detection has long been an important problem in various domains pertaining to information security such as financial fraud, social spam, network intrusion, etc. The majority of existing methods are performed in an unsupervised manner, as labeled anomalies in a large scale are often too expensive to acquire. However, the identified anomalies may turn out to be data noises or uninteresting data instances due to the lack of prior knowledge on the anomalies. In realistic scenarios, it is often feasible to obtain limited labeled anomalies, which have great potential to advance graph anomaly detection. However, the work exploring limited labeled anomalies and a large amount of unlabeled nodes in graphs to detect anomalies is rather limited. Therefore, in this paper, we study a novel problem of few-shot graph anomaly detection. We propose a new framework MetaGAD to learn to meta-transfer the knowledge between unlabeled and labeled nodes for graph anomaly detection. Experimental results on six real-world datasets with synthetic anomalies and "organic" anomalies (available in the dataset) demonstrate the effectiveness of the proposed approach in detecting anomalies with limited labeled anomalies. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.08640 [pdf, other]

TAP: A Comprehensive Data Repository for Traffic Accident Prediction in Road Networks

Authors: Baixiang Huang, Bryan Hooi, Kai Shu

Abstract: Road safety is a major global public health concern. Effective traffic crash prediction can play a critical role in reducing road traffic accidents. However, Existing machine learning approaches tend to focus on predicting traffic accidents in isolation, without considering the potential relationships between different accident locations within road networks. To incorporate graph structure informa… ▽ More Road safety is a major global public health concern. Effective traffic crash prediction can play a critical role in reducing road traffic accidents. However, Existing machine learning approaches tend to focus on predicting traffic accidents in isolation, without considering the potential relationships between different accident locations within road networks. To incorporate graph structure information, graph-based approaches such as Graph Neural Networks (GNNs) can be naturally applied. However, applying GNNs to the accident prediction problem faces challenges due to the lack of suitable graph-structured traffic accident datasets. To bridge this gap, we have constructed a real-world graph-based Traffic Accident Prediction (TAP) data repository, along with two representative tasks: accident occurrence prediction and accident severity prediction. With nationwide coverage, real-world network topology, and rich geospatial features, this data repository can be used for a variety of traffic-related tasks. We further comprehensively evaluate eleven state-of-the-art GNN variants and two non-graph-based machine learning methods using the created datasets. Significantly facilitated by the proposed data, we develop a novel Traffic Accident Vulnerability Estimation via Linkage (TRAVEL) model, which is designed to capture angular and directional information from road networks. We demonstrate that the proposed model consistently outperforms the baselines. The data and code are available on GitHub (https://github.com/baixianghuang/travel). △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: 10 pages, 5 figures

arXiv:2302.07363 [pdf, other]

Attacking Fake News Detectors via Manipulating News Social Engagement

Authors: Haoran Wang, Yingtong Dou, Canyu Chen, Lichao Sun, Philip S. Yu, Kai Shu

Abstract: Social media is one of the main sources for news consumption, especially among the younger generation. With the increasing popularity of news consumption on various social media platforms, there has been a surge of misinformation which includes false information or unfounded claims. As various text- and social context-based fake news detectors are proposed to detect misinformation on social media,… ▽ More Social media is one of the main sources for news consumption, especially among the younger generation. With the increasing popularity of news consumption on various social media platforms, there has been a surge of misinformation which includes false information or unfounded claims. As various text- and social context-based fake news detectors are proposed to detect misinformation on social media, recent works start to focus on the vulnerabilities of fake news detectors. In this paper, we present the first adversarial attack framework against Graph Neural Network (GNN)-based fake news detectors to probe their robustness. Specifically, we leverage a multi-agent reinforcement learning (MARL) framework to simulate the adversarial behavior of fraudsters on social media. Research has shown that in real-world settings, fraudsters coordinate with each other to share different news in order to evade the detection of fake news detectors. Therefore, we modeled our MARL framework as a Markov Game with bot, cyborg, and crowd worker agents, which have their own distinctive cost, budget, and influence. We then use deep Q-learning to search for the optimal policy that maximizes the rewards. Extensive experimental results on two real-world fake news propagation datasets demonstrate that our proposed framework can effectively sabotage the GNN-based fake news detector performance. We hope this paper can provide insights for future research on fake news detection. △ Less

Submitted 27 April, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: ACM Web Conference 2023 (WWW'23)

arXiv:2212.12621 [pdf, other]

Nothing Stands Alone: Relational Fake News Detection with Hypergraph Neural Networks

Authors: Ujun Jeong, Kaize Ding, Lu Cheng, Ruocheng Guo, Kai Shu, Huan Liu

Abstract: Nowadays, fake news easily propagates through online social networks and becomes a grand threat to individuals and society. Assessing the authenticity of news is challenging due to its elaborately fabricated contents, making it difficult to obtain large-scale annotations for fake news data. Due to such data scarcity issues, detecting fake news tends to fail and overfit in the supervised setting. R… ▽ More Nowadays, fake news easily propagates through online social networks and becomes a grand threat to individuals and society. Assessing the authenticity of news is challenging due to its elaborately fabricated contents, making it difficult to obtain large-scale annotations for fake news data. Due to such data scarcity issues, detecting fake news tends to fail and overfit in the supervised setting. Recently, graph neural networks (GNNs) have been adopted to leverage the richer relational information among both labeled and unlabeled instances. Despite their promising results, they are inherently focused on pairwise relations between news, which can limit the expressive power for capturing fake news that spreads in a group-level. For example, detecting fake news can be more effective when we better understand relations between news pieces shared among susceptible users. To address those issues, we propose to leverage a hypergraph to represent group-wise interaction among news, while focusing on important news relations with its dual-level attention mechanism. Experiments based on two benchmark datasets show that our approach yields remarkable performance and maintains the high performance even with a small subset of labeled news data. △ Less

Submitted 23 December, 2022; originally announced December 2022.

Comments: Accepted in IEEE Big Data 22

arXiv:2211.05289 [pdf, ps, other]

Combating Health Misinformation in Social Media: Characterization, Detection, Intervention, and Open Issues

Authors: Canyu Chen, Haoran Wang, Matthew Shapiro, Yunyu Xiao, Fei Wang, Kai Shu

Abstract: Social media has been one of the main information consumption sources for the public, allowing people to seek and spread information more quickly and easily. However, the rise of various social media platforms also enables the proliferation of online misinformation. In particular, misinformation in the health domain has significant impacts on our society such as the COVID-19 infodemic. Therefore,… ▽ More Social media has been one of the main information consumption sources for the public, allowing people to seek and spread information more quickly and easily. However, the rise of various social media platforms also enables the proliferation of online misinformation. In particular, misinformation in the health domain has significant impacts on our society such as the COVID-19 infodemic. Therefore, health misinformation in social media has become an emerging research direction that attracts increasing attention from researchers of different disciplines. Compared to misinformation in other domains, the key differences of health misinformation include the potential of causing actual harm to humans' bodies and even lives, the hardness to identify for normal people, and the deep connection with medical science. In addition, health misinformation on social media has distinct characteristics from conventional channels such as television on multiple dimensions including the generation, dissemination, and consumption paradigms. Because of the uniqueness and importance of combating health misinformation in social media, we conduct this survey to further facilitate interdisciplinary research on this problem. In this survey, we present a comprehensive review of existing research about online health misinformation in different disciplines. Furthermore, we also systematically organize the related literature from three perspectives: characterization, detection, and intervention. Lastly, we conduct a deep discussion on the pressing open issues of combating health misinformation in social media and provide future directions for multidisciplinary researchers. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: 31 pages, 241 references

arXiv:2211.03721 [pdf, other]

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

Authors: Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphanso, **yu Li, Yifan Gong

Abstract: Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embe… ▽ More Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong baselines, while being significantly smaller in size and retaining customization capabilities. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: 8 pages. 6 page paper 2 page references

arXiv:2207.08336 [pdf, other]

When Fairness Meets Privacy: Fair Classification with Semi-Private Sensitive Attributes

Authors: Canyu Chen, Yueqing Liang, Xiongxiao Xu, Shangyu Xie, Ashish Kundu, Ali Payani, Yuan Hong, Kai Shu

Abstract: Machine learning models have demonstrated promising performance in many areas. However, the concerns that they can be biased against specific demographic groups hinder their adoption in high-stake applications. Thus, it is essential to ensure fairness in machine learning models. Most previous efforts require direct access to sensitive attributes for mitigating bias. Nonetheless, it is often infeas… ▽ More Machine learning models have demonstrated promising performance in many areas. However, the concerns that they can be biased against specific demographic groups hinder their adoption in high-stake applications. Thus, it is essential to ensure fairness in machine learning models. Most previous efforts require direct access to sensitive attributes for mitigating bias. Nonetheless, it is often infeasible to obtain large-scale users' sensitive attributes considering users' concerns about privacy in the data collection process. Privacy mechanisms such as local differential privacy (LDP) are widely enforced on sensitive information in the data collection stage due to legal compliance and people's increasing awareness of privacy. Therefore, a critical problem is how to make fair predictions under privacy. We study a novel and practical problem of fair classification in a semi-private setting, where most of the sensitive attributes are private and only a small amount of clean ones are available. To this end, we propose a novel framework FairSP that can achieve Fair prediction under the Semi-Private setting. First, FairSP learns to correct the noise-protected sensitive attributes by exploiting the limited clean sensitive attributes. Then, it jointly models the corrected and clean data in an adversarial way for debiasing and prediction. Theoretical analysis shows that the proposed model can ensure fairness under mild assumptions in the semi-private setting. Extensive experimental results on real-world datasets demonstrate the effectiveness of our method for making fair predictions under privacy and maintaining high accuracy. △ Less

Submitted 29 May, 2023; v1 submitted 17 July, 2022; originally announced July 2022.

arXiv:2206.12808 [pdf, other]

doi 10.1109/TKDE.2022.3185151

Memory-Guided Multi-View Multi-Domain Fake News Detection

Authors: Yongchun Zhu, Qiang Sheng, Juan Cao, Qiong Nan, Kai Shu, Minghui Wu, **dong Wang, Fuzhen Zhuang

Abstract: The wide spread of fake news is increasingly threatening both individuals and society. Great efforts have been made for automatic fake news detection on a single domain (e.g., politics). However, correlations exist commonly across multiple news domains, and thus it is promising to simultaneously detect fake news of multiple domains. Based on our analysis, we pose two challenges in multi-domain fak… ▽ More The wide spread of fake news is increasingly threatening both individuals and society. Great efforts have been made for automatic fake news detection on a single domain (e.g., politics). However, correlations exist commonly across multiple news domains, and thus it is promising to simultaneously detect fake news of multiple domains. Based on our analysis, we pose two challenges in multi-domain fake news detection: 1) domain shift, caused by the discrepancy among domains in terms of words, emotions, styles, etc. 2) domain labeling incompleteness, stemming from the real-world categorization that only outputs one single domain label, regardless of topic diversity of a news piece. In this paper, we propose a Memory-guided Multi-view Multi-domain Fake News Detection Framework (M$^3$FEND) to address these two challenges. We model news pieces from a multi-view perspective, including semantics, emotion, and style. Specifically, we propose a Domain Memory Bank to enrich domain information which could discover potential domain labels based on seen news pieces and model domain characteristics. Then, with enriched domain information as input, a Domain Adapter could adaptively aggregate discriminative information from multiple views for news in various domains. Extensive offline experiments on English and Chinese datasets demonstrate the effectiveness of M$^3$FEND, and online tests verify its superiority in practice. Our code is available at https://github.com/ICTMCG/M3FEND. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)

arXiv:2206.10071 [pdf, other]

BOND: Benchmarking Unsupervised Outlier Node Detection on Static Attributed Graphs

Authors: Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, Lichao Sun, Jundong Li, George H. Chen, Zhihao Jia, Philip S. Yu

Abstract: Detecting which nodes in graphs are outliers is a relatively new machine learning task with numerous applications. Despite the proliferation of algorithms developed in recent years for this task, there has been no standard comprehensive setting for performance evaluation. Consequently, it has been difficult to understand which methods work well and when under a broad range of settings. To bridge t… ▽ More Detecting which nodes in graphs are outliers is a relatively new machine learning task with numerous applications. Despite the proliferation of algorithms developed in recent years for this task, there has been no standard comprehensive setting for performance evaluation. Consequently, it has been difficult to understand which methods work well and when under a broad range of settings. To bridge this gap, we present--to the best of our knowledge--the first comprehensive benchmark for unsupervised outlier node detection on static attributed graphs called BOND, with the following highlights. (1) We benchmark the outlier detection performance of 14 methods ranging from classical matrix factorization to the latest graph neural networks. (2) Using nine real datasets, our benchmark assesses how the different detection methods respond to two major types of synthetic outliers and separately to "organic" (real non-synthetic) outliers. (3) Using an existing random graph generation technique, we produce a family of synthetically generated datasets of different graph sizes that enable us to compare the running time and memory usage of the different outlier detection algorithms. Based on our experimental results, we discuss the pros and cons of existing graph outlier detection algorithms, and we highlight opportunities for future research. Importantly, our code is freely available and meant to be easily extendable: https://github.com/pygod-team/pygod/tree/main/benchmark △ Less

Submitted 15 October, 2022; v1 submitted 20 June, 2022; originally announced June 2022.

Comments: NeurIPS 2022. Benchmark available at https://github.com/pygod-team/pygod/tree/main/benchmark

arXiv:2206.03656 [pdf, other]

doi 10.3389/fdata.2022.1049565

Fair Classification via Domain Adaptation: A Dual Adversarial Learning Approach

Authors: Yueqing Liang, Canyu Chen, Tian Tian, Kai Shu

Abstract: Modern machine learning (ML) models are becoming increasingly popular and are widely used in decision-making systems. However, studies have shown critical issues of ML discrimination and unfairness, which hinder their adoption on high-stake applications. Recent research on fair classifiers has drawn significant attention to develo** effective algorithms to achieve fairness and good classificatio… ▽ More Modern machine learning (ML) models are becoming increasingly popular and are widely used in decision-making systems. However, studies have shown critical issues of ML discrimination and unfairness, which hinder their adoption on high-stake applications. Recent research on fair classifiers has drawn significant attention to develo** effective algorithms to achieve fairness and good classification performance. Despite the great success of these fairness-aware machine learning models, most of the existing models require sensitive attributes to pre-process the data, regularize the model learning or post-process the prediction to have fair predictions. However, sensitive attributes are often incomplete or even unavailable due to privacy, legal or regulation restrictions. Though we lack the sensitive attribute for training a fair model in the target domain, there might exist a similar domain that has sensitive attributes. Thus, it is important to exploit auxiliary information from a similar domain to help improve fair classification in the target domain. Therefore, in this paper, we study a novel problem of exploring domain adaptation for fair classification. We propose a new framework that can learn to adapt the sensitive attributes from a source domain for fair classification in the target domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model for fair classification, even when no sensitive attributes are available in the target domain. △ Less

Submitted 30 May, 2023; v1 submitted 7 June, 2022; originally announced June 2022.

arXiv:2205.12401 [pdf, other]

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Authors: Xinran Liang, Katherine Shu, Kimin Lee, Pieter Abbeel

Abstract: Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-base… ▽ More Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: ICLR 2022. Last two authors advised equally

arXiv:2205.09229 [pdf, other]

PromptDA: Label-guided Data Augmentation for Prompt-based Few-shot Learners

Authors: Canyu Chen, Kai Shu

Abstract: Recent advances in large pre-trained language models (PLMs) lead to impressive gains in natural language understanding (NLU) tasks with task-specific fine-tuning. However, directly fine-tuning PLMs heavily relies on sufficient labeled training instances, which are usually hard to obtain. Prompt-based tuning on PLMs has shown to be powerful for various downstream few-shot tasks. Existing works stud… ▽ More Recent advances in large pre-trained language models (PLMs) lead to impressive gains in natural language understanding (NLU) tasks with task-specific fine-tuning. However, directly fine-tuning PLMs heavily relies on sufficient labeled training instances, which are usually hard to obtain. Prompt-based tuning on PLMs has shown to be powerful for various downstream few-shot tasks. Existing works studying prompt-based tuning for few-shot NLU tasks mainly focus on deriving proper label words with a verbalizer or generating prompt templates to elicit semantics from PLMs. In addition, conventional data augmentation strategies such as synonym substitution, though widely adopted in low-resource scenarios, only bring marginal improvements for prompt-based few-shot learning. Thus, an important research question arises: how to design effective data augmentation methods for prompt-based few-shot tuning? To this end, considering the label semantics are essential in prompt-based tuning, we propose a novel label-guided data augmentation framework PromptDA, which exploits the enriched label semantic information for data augmentation. Extensive experiment results on few-shot text classification tasks demonstrate the superior performance of the proposed framework by effectively leveraging label semantics and data augmentation for natural language understanding. Our code is available at https://github.com/canyuchen/PromptDA. △ Less

Submitted 22 March, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

Comments: Accepted to Proceedings of EACL 2023 main conference. Code is available at https://github.com/canyuchen/PromptDA

arXiv:2205.03068 [pdf, other]

doi 10.1016/j.ipm.2022.102959

Characterizing Multi-Domain False News and Underlying User Effects on Chinese Weibo

Authors: Qiang Sheng, Juan Cao, H. Russell Bernard, Kai Shu, **tao Li, Huan Liu

Abstract: False news that spreads on social media has proliferated over the past years and has led to multi-aspect threats in the real world. While there are studies of false news on specific domains (like politics or health care), little work is found comparing false news across domains. In this article, we investigate false news across nine domains on Weibo, the largest Twitter-like social media platform… ▽ More False news that spreads on social media has proliferated over the past years and has led to multi-aspect threats in the real world. While there are studies of false news on specific domains (like politics or health care), little work is found comparing false news across domains. In this article, we investigate false news across nine domains on Weibo, the largest Twitter-like social media platform in China, from 2009 to 2019. The newly collected data comprise 44,728 posts in the nine domains, published by 40,215 users, and reposted over 3.4 million times. Based on the distributions and spreads of the multi-domain dataset, we observe that false news in domains that are close to daily life like health and medicine generated more posts but diffused less effectively than those in other domains like politics, and that political false news had the most effective capacity for diffusion. The widely diffused false news posts on Weibo were associated strongly with certain types of users -- by gender, age, etc. Further, these posts provoked strong emotions in the reposts and diffused further with the active engagement of false-news starters. Our findings have the potential to help design false news detection systems in suspicious news discovery, veracity prediction, and display and explanation. The comparison of the findings on Weibo with those of existing work demonstrates nuanced patterns, suggesting the need for more research on data from diverse platforms, countries, or languages to tackle the global issue of false news. The code and new anonymized dataset are available at https://github.com/ICTMCG/Characterizing-Weibo-Multi-Domain-False-News. △ Less

Submitted 21 June, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: 22 pages, 5 figures, and 12 tables. Accepted by Information Processing and Management (IP&M)

arXiv:2202.08159 [pdf, other]

doi 10.1145/3485447.3512258

Domain Adaptive Fake News Detection via Reinforcement Learning

Authors: Ahmadreza Mosallanezhad, Mansooreh Karami, Kai Shu, Michelle V. Mancenido, Huan Liu

Abstract: With social media being a major force in information consumption, accelerated propagation of fake news has presented new challenges for platforms to distinguish between legitimate and fake news. Effective fake news detection is a non-trivial task due to the diverse nature of news domains and expensive annotation costs. In this work, we address the limitations of existing automated fake news detect… ▽ More With social media being a major force in information consumption, accelerated propagation of fake news has presented new challenges for platforms to distinguish between legitimate and fake news. Effective fake news detection is a non-trivial task due to the diverse nature of news domains and expensive annotation costs. In this work, we address the limitations of existing automated fake news detection models by incorporating auxiliary information (e.g., user comments and user-news interactions) into a novel reinforcement learning-based model called \textbf{RE}inforced \textbf{A}daptive \textbf{L}earning \textbf{F}ake \textbf{N}ews \textbf{D}etection (REAL-FND). REAL-FND exploits cross-domain and within-domain knowledge that makes it robust in a target domain, despite being trained in a different source domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model, especially when limited labeled data is available in the target domain. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted to The ACM Web Conference (WWW) 2022

arXiv:2202.04752 [pdf, other]

doi 10.1145/3485447.3512264

"This is Fake! Shared it by Mistake": Assessing the Intent of Fake News Spreaders

Authors: Xinyi Zhou, Kai Shu, Vir V. Phoha, Huan Liu, Reza Zafarani

Abstract: Individuals can be misled by fake news and spread it unintentionally without knowing it is false. This phenomenon has been frequently observed but has not been investigated. Our aim in this work is to assess the intent of fake news spreaders. To distinguish between intentional versus unintentional spreading, we study the psychological explanations of unintentional spreading. With this foundation,… ▽ More Individuals can be misled by fake news and spread it unintentionally without knowing it is false. This phenomenon has been frequently observed but has not been investigated. Our aim in this work is to assess the intent of fake news spreaders. To distinguish between intentional versus unintentional spreading, we study the psychological explanations of unintentional spreading. With this foundation, we then propose an influence graph, using which we assess the intent of fake news spreaders. Our extensive experiments show that the assessed intent can help significantly differentiate between intentional and unintentional fake news spreaders. Furthermore, the estimated intent can significantly improve the current techniques that detect fake news. To our best knowledge, this is the first work to model individuals' intent in fake news spreading. △ Less

Submitted 13 February, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: Accepted to The ACM Web Conference (WWW) 2022

arXiv:2110.05644 [pdf, ps, other]

On the computational equivalence of co-NP refutations of a matrix being a P-matrix

Authors: Spencer Gordon, Kevin Shu

Abstract: A P-matrix is a square matrix $X$ such that all principal submatrices of $X$ have positive determinant. Such matrices appear naturally in instances of the linear complementarity problem, where these are precisely the matrices for which the corresponding linear complementarity problem has a unique solution for any input vector. Testing whether or not a square matrix is a P-matrix is co-NP compl… ▽ More A P-matrix is a square matrix $X$ such that all principal submatrices of $X$ have positive determinant. Such matrices appear naturally in instances of the linear complementarity problem, where these are precisely the matrices for which the corresponding linear complementarity problem has a unique solution for any input vector. Testing whether or not a square matrix is a P-matrix is co-NP complete, so while it is possible to exhibit polynomially-sized witnesses for the fact that a matrix is not a P-matrix, it is believed that there is no efficient way to prove that a given matrix is a P-matrix. We will show that several well known witnesses for the fact that a matrix is not a P-matrix are computationally equivalent, so that we are able to convert between them in polynomial time, answering a question raised in arXiv:1811.03841 . △ Less

Submitted 11 October, 2021; originally announced October 2021.

arXiv:2110.00911 [pdf, other]

Enhancing Model Robustness and Fairness with Causality: A Regularization Approach

Authors: Zhao Wang, Kai Shu, Aron Culotta

Abstract: Recent work has raised concerns on the risk of spurious correlations and unintended biases in statistical machine learning models that threaten model robustness and fairness. In this paper, we propose a simple and intuitive regularization approach to integrate causal knowledge during model training and build a robust and fair model by emphasizing causal features and de-emphasizing spurious feature… ▽ More Recent work has raised concerns on the risk of spurious correlations and unintended biases in statistical machine learning models that threaten model robustness and fairness. In this paper, we propose a simple and intuitive regularization approach to integrate causal knowledge during model training and build a robust and fair model by emphasizing causal features and de-emphasizing spurious features. Specifically, we first manually identify causal and spurious features with principles inspired from the counterfactual framework of causal inference. Then, we propose a regularization approach to penalize causal and spurious features separately. By adjusting the strength of the penalty for each type of feature, we build a predictive model that relies more on causal features and less on non-causal features. We conduct experiments to evaluate model robustness and fairness on three datasets with multiple metrics. Empirical results show that the new models built with causal awareness significantly improve model robustness with respect to counterfactual texts and model fairness with respect to sensitive attributes. △ Less

Submitted 2 October, 2021; originally announced October 2021.

arXiv:2110.00737 [pdf, other]

doi 10.1007/s13278-022-00921-9

A Survey of COVID-19 Misinformation: Datasets, Detection Techniques and Open Issues

Authors: A. R. Sana Ullah, Anupam Das, Anik Das, Muhammad Ashad Kabir, Kai Shu

Abstract: Misinformation during pandemic situations like COVID-19 is growing rapidly on social media and other platforms. This expeditious growth of misinformation creates adverse effects on the people living in the society. Researchers are trying their best to mitigate this problem using different approaches based on Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP). This sur… ▽ More Misinformation during pandemic situations like COVID-19 is growing rapidly on social media and other platforms. This expeditious growth of misinformation creates adverse effects on the people living in the society. Researchers are trying their best to mitigate this problem using different approaches based on Machine Learning (ML), Deep Learning (DL), and Natural Language Processing (NLP). This survey aims to study different approaches of misinformation detection on COVID-19 in recent literature to help the researchers in this domain. More specifically, we review the different methods used for COVID-19 misinformation detection in their research with an overview of data pre-processing and feature extraction methods to get a better understanding of their work. We also summarize the existing datasets which can be used for further research. Finally, we discuss the limitations of the existing methods and highlight some potential future research directions along this dimension to combat the spreading of misinformation during a pandemic. △ Less

Submitted 24 October, 2021; v1 submitted 2 October, 2021; originally announced October 2021.

Comments: 43 pages, 6 figures

Journal ref: Social Network Analysis and Mining, 2022

arXiv:2108.12603 [pdf, other]

WALNUT: A Benchmark on Semi-weakly Supervised Learning for Natural Language Understanding

Authors: Guoqing Zheng, Giannis Karamanolakis, Kai Shu, Ahmed Hassan Awadallah

Abstract: Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to com… ▽ More Building machine learning models for natural language understanding (NLU) tasks relies heavily on labeled data. Weak supervision has been proven valuable when large amount of labeled data is unavailable or expensive to obtain. Existing works studying weak supervision for NLU either mostly focus on a specific task or simulate weak supervision signals from ground-truth labels. It is thus hard to compare different approaches and evaluate the benefit of weak supervision without access to a unified and systematic benchmark with diverse tasks and real-world weak labeling rules. In this paper, we propose such a benchmark, named WALNUT (semi-WeAkly supervised Learning for Natural language Understanding Testbed), to advocate and facilitate research on weak supervision for NLU. WALNUT consists of NLU tasks with different types, including document-level and token-level prediction tasks. WALNUT is the first semi-weakly supervised learning benchmark for NLU, where each task contains weak labels generated by multiple real-world weak sources, together with a small set of clean labels. We conduct baseline evaluations on WALNUT to systematically evaluate the effectiveness of various weak supervision methods and model architectures. Our results demonstrate the benefit of weak supervision for low-resource NLU tasks and highlight interesting patterns across tasks. We expect WALNUT to stimulate further research on methodologies to leverage weak supervision more effectively. The benchmark and code for baselines are available at \url{aka.ms/walnut_benchmark}. △ Less

Submitted 22 May, 2022; v1 submitted 28 August, 2021; originally announced August 2021.

Comments: Accepted to NAACL 2022 (Long Paper)

arXiv:2106.07564 [pdf]

An optimized Capsule-LSTM model for facial expression recognition with video sequences

Authors: Siwei Liu, Yuanpeng Long, Gao Xu, Lijia Yang, Shimei Xu, Xiaoming Yao, Kunxian Shu

Abstract: To overcome the limitations of convolutional neural network in the process of facial expression recognition, a facial expression recognition model Capsule-LSTM based on video frame sequence is proposed. This model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network. The capsule encoder extracts the spatial information of facial expressions in video frames. Ca… ▽ More To overcome the limitations of convolutional neural network in the process of facial expression recognition, a facial expression recognition model Capsule-LSTM based on video frame sequence is proposed. This model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network. The capsule encoder extracts the spatial information of facial expressions in video frames. Capsule decoder reconstructs the images to optimize the network. LSTM extracts the temporal information between video frames and analyzes the differences in expression changes between frames. The experimental results from the MMI dataset show that the Capsule-LSTM model proposed in this paper can effectively improve the accuracy of video expression recognition. △ Less

Submitted 27 May, 2021; originally announced June 2021.

Comments: 14pages,4 figurews

arXiv:2106.07563 [pdf]

BPLF: A Bi-Parallel Linear Flow Model for Facial Expression Generation from Emotion Set Images

Authors: Gao Xu, Yuanpeng Long, Siwei Liu, Lijia Yang, Shimei Xu, Xiaoming Yao, Kunxian Shu

Abstract: The flow-based generative model is a deep learning generative model, which obtains the ability to generate data by explicitly learning the data distribution. Theoretically its ability to restore data is stronger than other generative models. However, its implementation has many limitations, including limited model design, too many model parameters and tedious calculation. In this paper, a bi-paral… ▽ More The flow-based generative model is a deep learning generative model, which obtains the ability to generate data by explicitly learning the data distribution. Theoretically its ability to restore data is stronger than other generative models. However, its implementation has many limitations, including limited model design, too many model parameters and tedious calculation. In this paper, a bi-parallel linear flow model for facial emotion generation from emotion set images is constructed, and a series of improvements have been made in terms of the expression ability of the model and the convergence speed in training. The model is mainly composed of several coupling layers superimposed to form a multi-scale structure, in which each coupling layer contains 1*1 reversible convolution and linear operation modules. Furthermore, this paper sorted out the current public data set of facial emotion images, made a new emotion data, and verified the model through this data set. The experimental results show that, under the traditional convolutional neural network, the 3-layer 3*3 convolution kernel is more conducive to extracte the features of the face images. The introduction of principal component decomposition can improve the convergence speed of the model. △ Less

Submitted 27 May, 2021; originally announced June 2021.

Comments: 20 pages, 10 figures

arXiv:2106.04716 [pdf, other]

Labeled Data Generation with Inexact Supervision

Authors: Enyan Dai, Kai Shu, Yiwei Sun, Suhang Wang

Abstract: The recent advanced deep learning techniques have shown the promising results in various domains such as computer vision and natural language processing. The success of deep neural networks in supervised learning heavily relies on a large amount of labeled data. However, obtaining labeled data with target labels is often challenging due to various reasons such as cost of labeling and privacy issue… ▽ More The recent advanced deep learning techniques have shown the promising results in various domains such as computer vision and natural language processing. The success of deep neural networks in supervised learning heavily relies on a large amount of labeled data. However, obtaining labeled data with target labels is often challenging due to various reasons such as cost of labeling and privacy issues, which challenges existing deep models. In spite of that, it is relatively easy to obtain data with \textit{inexact supervision}, i.e., having labels/tags related to the target task. For example, social media platforms are overwhelmed with billions of posts and images with self-customized tags, which are not the exact labels for target classification tasks but are usually related to the target labels. It is promising to leverage these tags (inexact supervision) and their relations with target classes to generate labeled data to facilitate the downstream classification tasks. However, the work on this is rather limited. Therefore, we study a novel problem of labeled data generation with inexact supervision. We propose a novel generative framework named as ADDES which can synthesize high-quality labeled data for target classification tasks by learning from data with inexact supervision and the relations between inexact supervision and target classes. Experimental results on image and text datasets demonstrate the effectiveness of the proposed ADDES for generating realistic labeled data from inexact supervision to facilitate the target classification task. △ Less

Submitted 8 June, 2021; originally announced June 2021.

arXiv:2104.14537 [pdf, other]

doi 10.1145/3488560.3498493

Towards Fair Classifiers Without Sensitive Attributes: Exploring Biases in Related Features

Authors: Tianxiang Zhao, Enyan Dai, Kai Shu, Suhang Wang

Abstract: Despite the rapid development and great success of machine learning models, extensive studies have exposed their disadvantage of inheriting latent discrimination and societal bias from the training data. This phenomenon hinders their adoption on high-stake applications. Thus, many efforts have been taken for develo** fair machine learning models. Most of them require that sensitive attributes ar… ▽ More Despite the rapid development and great success of machine learning models, extensive studies have exposed their disadvantage of inheriting latent discrimination and societal bias from the training data. This phenomenon hinders their adoption on high-stake applications. Thus, many efforts have been taken for develo** fair machine learning models. Most of them require that sensitive attributes are available during training to learn fair models. However, in many real-world applications, it is usually infeasible to obtain the sensitive attributes due to privacy or legal issues, which challenges existing fair-ensuring strategies. Though the sensitive attribute of each data sample is unknown, we observe that there are usually some non-sensitive features in the training data that are highly correlated with sensitive attributes, which can be used to alleviate the bias. Therefore, in this paper, we study a novel problem of exploring features that are highly correlated with sensitive attributes for learning fair and accurate classifiers. We theoretically show that by minimizing the correlation between these related features and model prediction, we can learn a fair classifier. Based on this motivation, we propose a novel framework which simultaneously uses these related features for accurate prediction and enforces fairness. In addition, the model can dynamically adjust the regularization weight of each related feature to balance its contribution on model classification and fairness. Experimental results on real-world datasets demonstrate the effectiveness of the proposed model for learning fair models with high classification accuracy. △ Less

Submitted 28 December, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Journal ref: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM 2022)

arXiv:2104.12259 [pdf, other]

User Preference-aware Fake News Detection

Authors: Yingtong Dou, Kai Shu, Congying Xia, Philip S. Yu, Lichao Sun

Abstract: Disinformation and fake news have posed detrimental effects on individuals and society in recent years, attracting broad attention to fake news detection. The majority of existing fake news detection algorithms focus on mining news content and/or the surrounding exogenous context for discovering deceptive signals; while the endogenous preference of a user when he/she decides to spread a piece of f… ▽ More Disinformation and fake news have posed detrimental effects on individuals and society in recent years, attracting broad attention to fake news detection. The majority of existing fake news detection algorithms focus on mining news content and/or the surrounding exogenous context for discovering deceptive signals; while the endogenous preference of a user when he/she decides to spread a piece of fake news or not is ignored. The confirmation bias theory has indicated that a user is more likely to spread a piece of fake news when it confirms his/her existing beliefs/preferences. Users' historical, social engagements such as posts provide rich information about users' preferences toward news and have great potential to advance fake news detection. However, the work on exploring user preference for fake news detection is somewhat limited. Therefore, in this paper, we study the novel problem of exploiting user preference for fake news detection. We propose a new framework, UPFD, which simultaneously captures various signals from user preferences by joint content and graph modeling. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. We release our code and data as a benchmark for GNN-based fake news detection: https://github.com/safe-graph/GNN-FakeNews. △ Less

Submitted 25 April, 2021; originally announced April 2021.

Comments: Accepted by SIGIR'21. Code is available at https://github.com/safe-graph/GNN-FakeNews

arXiv:2103.02834 [pdf, ps, other]

Causal Channels

Authors: Kevin Shu

Abstract: We consider causal models with two observed variables and one latent variables, each variable being discrete, with the goal of characterizing the possible distributions on outcomes that can result from controlling one of the observed variables. We optimize linear functions over the space of all possible interventional distributions, which allows us find properties of the interventional distributio… ▽ More We consider causal models with two observed variables and one latent variables, each variable being discrete, with the goal of characterizing the possible distributions on outcomes that can result from controlling one of the observed variables. We optimize linear functions over the space of all possible interventional distributions, which allows us find properties of the interventional distribution even when we cannot uniquely identify what it is. We show that, under certain mild assumptions about the correlation between controlled variable and the latent variable, the resulting interventional distribution must be close to the observed conditional distribution in a quantitative sense. Specifically, we show that if the observed variables are sufficiently highly correlated, and the latent variable can only take on a small number of distinct values, then the variables will remain causally related after passing to the interventional distribution. Another result, possibly of more general interest, is a bound on the distance between the interventional distribution and the observed conditional distribution in terms of the mutual information between the controlled variable and the latent variable, which shows that the controlled variable and the latent variable must be tightly correlated for the interventional distribution to differ significantly from the observed distribution. We believe that this type of result may make it possible to rigorously consider 'weak' experiments, where the causal variable is not entirely independent from the environment, but only approximately so. More generally, we suggest a connection between the theory of causality to polynomial optimization, which give useful bounds on the space of interventional distributions. △ Less

Submitted 4 March, 2021; originally announced March 2021.

arXiv:2012.04778 [pdf, other]

Fact-Enhanced Synthetic News Generation

Authors: Kai Shu, Yichuan Li, Kaize Ding, Huan Liu

Abstract: The advanced text generation methods have witnessed great success in text summarization, language translation, and synthetic news generation. However, these techniques can be abused to generate disinformation and fake news. To better understand the potential threats of synthetic news, we develop a new generation method FactGen to generate high-quality news content. The existing text generation met… ▽ More The advanced text generation methods have witnessed great success in text summarization, language translation, and synthetic news generation. However, these techniques can be abused to generate disinformation and fake news. To better understand the potential threats of synthetic news, we develop a new generation method FactGen to generate high-quality news content. The existing text generation methods either afford limited supplementary information or lose consistency between the input and output which makes the synthetic news less trustworthy. To address these issues, FactGen retrieves external facts to enrich the output and reconstructs the input claim from the generated content to improve the consistency among the input and the output. Experiment results on real-world datasets show that the generated news contents of FactGen are consistent and contain rich facts. We also discuss the possible defending method to identify these synthetic news pieces if FactGen is used to generate synthetic news. △ Less

Submitted 12 December, 2020; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: AAAI 2021 Preprint Version

arXiv:2011.04088 [pdf, other]

MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation

Authors: Yichuan Li, Bohan Jiang, Kai Shu, Huan Liu

Abstract: The COVID-19 epidemic is considered as the global health crisis of the whole society and the greatest challenge mankind faced since World War Two. Unfortunately, the fake news about COVID-19 is spreading as fast as the virus itself. The incorrect health measurements, anxiety, and hate speeches will have bad consequences on people's physical health, as well as their mental health in the whole world… ▽ More The COVID-19 epidemic is considered as the global health crisis of the whole society and the greatest challenge mankind faced since World War Two. Unfortunately, the fake news about COVID-19 is spreading as fast as the virus itself. The incorrect health measurements, anxiety, and hate speeches will have bad consequences on people's physical health, as well as their mental health in the whole world. To help better combat the COVID-19 fake news, we propose a new fake news detection dataset MM-COVID(Multilingual and Multidimensional COVID-19 Fake News Data Repository). This dataset provides the multilingual fake news and the relevant social context. We collect 3981 pieces of fake news content and 7192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages. We present a detailed and exploratory analysis of MM-COVID from different perspectives and demonstrate the utility of MM-COVID in several potential applications of COVID-19 fake news study on multilingual and social media. △ Less

Submitted 23 November, 2020; v1 submitted 8 November, 2020; originally announced November 2020.

arXiv:2011.01579 [pdf, other]

Fake News Detection through Graph Comment Advanced Learning

Authors: Hao Liao, Qixin Liu, Kai Shu, Xing xie

Abstract: Disinformation has long been regarded as a severe social problem, where fake news is one of the most representative issues. What is worse, today's highly developed social media makes fake news widely spread at incredible speed, bringing in substantial harm to various aspects of human life. Yet, the popularity of social media also provides opportunities to better detect fake news. Unlike convention… ▽ More Disinformation has long been regarded as a severe social problem, where fake news is one of the most representative issues. What is worse, today's highly developed social media makes fake news widely spread at incredible speed, bringing in substantial harm to various aspects of human life. Yet, the popularity of social media also provides opportunities to better detect fake news. Unlike conventional means which merely focus on either content or user comments, effective collaboration of heterogeneous social media information, including content and context factors of news, users' comments and the engagement of social media with users, will hopefully give rise to better detection of fake news. Motivated by the above observations, a novel detection framework, namely graph comment-user advanced learning framework (GCAL) is proposed in this paper. User-comment information is crucial but not well studied in fake news detection. Thus, we model user-comment context through network representation learning based on heterogeneous graph neural network. We conduct experiments on two real-world datasets, which demonstrate that the proposed joint model outperforms 8 state-of-the-art baseline methods for fake news detection (at least 4% in Accuracy, 7% in Recall and 5% in F1). Moreover, the proposed method is also explainable. △ Less

Submitted 29 January, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

Showing 1–50 of 84 results for author: Shu, K